<h1>This exercise is to predict if a user will buy the course based on the pages they visited on the website<h1> 

In [54]:
import pandas as pd

data = pd.read_csv('data/tracking.csv')
data.head()

Unnamed: 0,home,how_it_works,contact,bought
0,1,1,0,0
1,1,1,0,0
2,1,1,0,0
3,1,1,0,0
4,1,1,0,0


In [2]:
x = data[["home", "how_it_works", "contact"]]
y = data["bought"]

<h3>Separating the training set and the test set<h3>

In [3]:
data.shape

(99, 4)

In [4]:
# We'll choose 25% of the data for testing, and 75% for training

# 0-74
train_x = x[:75]
train_y = y[:75]
# 75-99
test_x = x[75:]
test_y = y[75:]

print("We'll train with %d elements and test with %d elements" % (len(train_x), len(test_x)))

We'll train with 75 elements and test with 24 elements


<h3>Creating a model and testing the accuracy of its predictions<h3>

In [5]:
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

# Creating the model
model = LinearSVC()
model.fit(train_x, train_y)

# Making predictions
predictions = model.predict(test_x)

# Evaluating the model's accuracy
accuracy = accuracy_score(test_y, predictions) * 100

print("The accuracy was %.2f%%" % accuracy)

The accuracy was 95.83%


<h3>However, Sklearn already has a library for separating the training and testing sets, so let's import train_test_split from the sklearn.model_selection module<h3>

In [51]:
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Since train_test_split chooses the sets randomly, we'll set a fixed seed so numbers won't change everytime we run the code 
SEED = 30 # Any number

train_x, test_x, train_y, test_y = train_test_split(x, y, random_state = SEED, test_size=0.25) # Here we're setting the test size to be 25% of the whole dataset, leaving 75% for training
print("We'll train with %d elements and test with %d elements" % (len(train_x), len(test_x)))

# Creating the model
model = LinearSVC()
model.fit(train_x, train_y)

# Making predictions
predictions = model.predict(test_x)

# Evaluating the model's accuracy
accuracy = accuracy_score(test_y, predictions) * 100

print("The accuracy was %.2f%%" % accuracy)

We'll train with 74 elements and test with 25 elements
The accuracy was 92.00%


<h3>But there's still a problem<h3>

If, by chance, our training set has proportionally a lot more buyers, the model will think users buy way more often than they do. Therefore, we should tell the model to stratify the training and testing sets, that is, to randomly select data points for each set but taking into consideration the proportion of each label.

Look at the difference in proportions:

In [50]:
train_y.value_counts()

0    49
1    25
Name: bought, dtype: int64

In [49]:
test_y.value_counts()

0    17
1     8
Name: bought, dtype: int64

So let's do it all over again, but now stratifying the data

In [53]:
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Since train_test_split chooses the sets randomly, we'll set a fixed seed so numbers won't change everytime the code is run 
SEED = 30 # Any number

train_x, test_x, train_y, test_y = train_test_split(x, y, 
                                                        random_state = SEED, test_size=0.25, # Here we're setting the test size to be 25% of the whole dataset, leaving 75% for training
                                                        stratify = y) # Now we're stratifying properly
print("We'll train with %d elements and test with %d elements" % (len(train_x), len(test_x)))

# Creating the model
model = LinearSVC()
model.fit(train_x, train_y)

# Making predictions
predictions = model.predict(test_x)

# Evaluating the model's accuracy
accuracy = accuracy_score(test_y, predictions) * 100

print("The accuracy was %.2f%%" % accuracy)

We'll train with 74 elements and test with 25 elements
The accuracy was 96.00%
