In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('internet_usage_spain.csv')

In [3]:
df.head()

Unnamed: 0,internet_usage,sex,age,education
0,0,Mujer,66,Primaria
1,1,Hombre,72,Primaria
2,1,Hombre,48,Medios universitarios
3,0,Hombre,59,Superiores
4,1,Mujer,44,Superiores


In [4]:
df['sex'] = df.sex.replace({'Mujer': 'Female', 'Hombre': 'Male'})

In [5]:
u = df.education.unique()

In [6]:
o = ['Elementary', 'University', 'PhD', 'Higher Level', 'No studies', 'High School']

In [7]:
df['education'] = df.education.replace(dict(zip(u,o)))

In [8]:
df.head()

Unnamed: 0,internet_usage,sex,age,education
0,0,Female,66,Elementary
1,1,Male,72,Elementary
2,1,Male,48,University
3,0,Male,59,PhD
4,1,Female,44,PhD


In [9]:
df.to_csv('internet_usage_spain_v2.csv')

In [10]:
df = pd.get_dummies(df, drop_first=True)

In [11]:
target = df.internet_usage

In [12]:
explanatory = df.drop(columns='internet_usage')

It's tough to find things that always work the same way in programming.

The steps of a Machine Learning (ML) model can be an exception.

Everything we want to compute a model and make conclusions about it, we would always make the following steps:

1. `model.fit()`
2. `model.predict()`
3. `model.score()`

And I am going to show you this with 3 different ML models.

- Decision Tree Classifier
- Random Forest Classifier
- Logistic Regression

But, let's first prepare the data:

```python
df = pd.read_csv('https://raw.githubusercontent.com/sotastica/data/main/internet_usage_spain_v2.csv')
df.head()
```

- Each row represents a person
- Each column represents a characteristic of the person
- The goal is to classify a person whether they used the internet during the last 12 months
- Depending on their demographics

We need to transform the categorical variables to dummy variables before computing the models:

```python
df = pd.get_dummies(df, drop_first=True)
df.head()
```

Now we separate the variables on their respective role within the model:

```python
target = df.internet_usage
explanatory = df.drop(columns='internet_usage')
```

## Decision Tree Classifier

```python

```

In [27]:
## Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()
model.fit(X=explanatory, y=target)

pred_dt = model.predict(X=explanatory)
accuracy_dt = model.score(X=explanatory, y=target)

## Support Vector Machines

from sklearn.svm import SVC

model = SVC()
model.fit(X=explanatory, y=target)

pred_svm = model.predict(X=explanatory)
accuracy_svm = model.score(X=explanatory, y=target)

## Logistic Regression

from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()
model.fit(X=explanatory, y=target)

pred_lr = model.predict(X=explanatory)
accuracy_lr = model.score(X=explanatory, y=target)

The only thing that changes are the results of the prediccions. Obviously, the models are different. But they all follow the same steps that we described at the beginning:

1. `model.fit()` to compute the mathematical formula of the model
2. `model.predict()` to calculate predictions through the mathematical formula
3. `model.score()` to get the success ratio of the model



In [28]:
import dataframe_image as dfi

In [49]:
df_pred = pd.DataFrame({'internet_usage': df.internet_usage,
                        'pred_dt': pred_dt,
                        'pred_svm': pred_svm,
                        'pred_lr': pred_lr})

In [38]:
a = dfres.sample(10, random_state=7)

In [39]:
a

Unnamed: 0,internet_usage,pred_dt,pred_svm,pred_lr
214,0,0,1,0
2142,1,1,1,1
1680,1,0,0,0
1522,1,1,1,1
325,1,1,1,1
2283,1,1,1,1
1263,0,0,0,0
993,0,0,0,0
26,1,1,1,1
2190,0,0,0,0


In [23]:
dfi.export(a, 'dfres.png')

In [50]:
df_accuracy = pd.DataFrame({'accuracy': [accuracy_dt, accuracy_svm, accuracy_lr]},
                           index = ['Decision Tree Classifier', 'Support Vector Machines', 'Linear Regression'])

In [48]:
dfres

Unnamed: 0,accuracy
Decision Tree Classifier,0.859878
Support Vector Machines,0.783707
Linear Regression,0.834216


In [51]:
dfi.export(df_accuracy, 'df_accuracy.png')