In [1]:
!python --version

Python 3.7.13


**Julien VALENTIN**, **mars 2022** d'après

> [Machine learning in Python with scikit-learn](https://www.fun-mooc.fr/fr/cours/machine-learning-python-scikit-learn/) d'I.N.R.I.A sur [F.U.N](https://www.fun-mooc.fr/fr/).

# Téléchargement, import et préparation

Le jeu de données est décrit et disponible ici : [https://archive.ics.uci.edu/ml/datasets/adult](https://archive.ics.uci.edu/ml/datasets/adult)

In [2]:
!wget -O sample_data/adult.data https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data

--2022-03-28 12:04:08--  https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3974305 (3.8M) [application/x-httpd-php]
Saving to: ‘sample_data/adult.data’


2022-03-28 12:04:08 (16.4 MB/s) - ‘sample_data/adult.data’ saved [3974305/3974305]



In [3]:
!wget -O sample_data/adult.names https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names

--2022-03-28 12:04:08--  https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5229 (5.1K) [application/x-httpd-php]
Saving to: ‘sample_data/adult.names’


2022-03-28 12:04:08 (96.5 MB/s) - ‘sample_data/adult.names’ saved [5229/5229]



On commence par consulter le fichier `sample_data/adult.names`

In [4]:
column_names = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation", 
                "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", 
                "income"]

### lecture

On peut maintenant importer le dataset.

In [5]:
import pandas

dataset = pandas.read_csv("sample_data/adult.data", index_col=False, names=column_names)

### Exploration

In [6]:
dataset.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [7]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


### Conservation des données numériques et de l'attribut `income`

In [8]:
object_names = ["workclass", "education", "marital-status", "occupation", "relationship", "race", "sex", "native-country"]
numerical_dataset = dataset.drop(columns = object_names)
numerical_dataset.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,income
0,39,77516,13,2174,0,40,<=50K
1,50,83311,13,0,0,13,<=50K
2,38,215646,9,0,0,40,<=50K
3,53,234721,7,0,0,40,<=50K
4,28,338409,13,0,0,40,<=50K


### Séparation de la classe à prédire et des prédicteurs

In [9]:
target = numerical_dataset["income"]
target.value_counts()

 <=50K    24720
 >50K      7841
Name: income, dtype: int64

In [10]:
data = numerical_dataset.drop(columns = "income")
data.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


### Modélisation par *K-nearest neighbors*

In [11]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=50)

### Entraînement sur *toutes* les données

In [12]:
model.fit(data, target)

KNeighborsClassifier(n_neighbors=50)

### Prédiction de l'`income` sur les dix premières données

In [13]:
computed_target = model.predict(data[:10])
(computed_target == target[:10]).value_counts()

True     9
False    1
Name: income, dtype: int64

### Evaluation de la performance

In [14]:
model.score(data, target)

0.7978563311937594

Ceci est le score de l'entraînement. Dans les prochaines sections, nous verrons plusieurs outils pour séparer un jeu de données en données d'entraînement puis de test.

# Séparation en `train - test`

In [15]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data, target)

In [16]:
model.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=50)

In [17]:
model.score(X_train, y_train)

0.7951269451269452

In [18]:
model.score(X_test, y_test)

0.7927773000859846

### Cross-validation

In [19]:
from sklearn.model_selection import cross_validate

model = KNeighborsClassifier(n_neighbors=50)
cv_result = cross_validate(model, data, target, cv=5)
cv_result

{'fit_time': array([0.14648485, 0.18632555, 0.11668634, 0.13039684, 0.09442735]),
 'score_time': array([0.60815239, 0.54676414, 0.6917944 , 0.53785968, 0.5516839 ]),
 'test_score': array([0.79610011, 0.79391892, 0.79591523, 0.79437961, 0.79253686])}

In [20]:
scores = cv_result["test_score"]
print(
    "The mean cross-validation accuracy is: "
    f"{scores.mean():.3f} +/- {scores.std():.3f}"
)

The mean cross-validation accuracy is: 0.795 +/- 0.001
