## 1. Data aquisition  
         
#### Automatique Data aquisition    
We are going to aquire our dataset from the **[UCI Machine Learning](https://archive.ics.uci.edu/ml/datasets/adult)** website. Here are the following libraries that we will be using to acquire the dataset and perform all the preprocessing and analysis.

In [1]:
import requests # for web request
import os # Interaction with the operating system

In [2]:
# This function will be used to acquire the data from the UCI website
def aquire_data(path_to_data, data_urls):
    if not os.path.exists(path_to_data):
        os.mkdir(path_to_data)
        
    for url in data_urls:
        data = requests.get(url).content
        filename = os.path.join(path_to_data, os.path.basename(url))
        with open(filename, 'wb') as file: 
            file.write(data)

In [3]:
data_urls = ["https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
             "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names",
             "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test"]

aquire_data('data', data_urls)

This action will create a new folder called **data** in our working directory. It contains the following data:  
 
* **adult.names**: which corresponds to the different column names   
* **adult.data**: corresponds to all the observations in the training data.  
* **data.test**: corresponds to all the observation in the test data  


#### Convert data into a Pandas Data Frame    
Now, we are going to convert our data into a **[Pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)**.  
To do so, we will need pandas libraries using the **import** function.  

In [10]:
import pandas as pd 

Here we are going to acquire the training and the test datasets. 
The corresponding column names is specified in the **column_names** variable. Then, we use the regular expression **' \*, \*'** to trim all the whitespaces we can encounter in our datasets. As all the missing values have been specificied by **?**, so, **na_values** is used to take them into consideration during the data loading. Finally we specify **engine='python'** to avoid the warning that comes after using regular expression syntax.  

In [11]:
column_names = ["Age", "Workclass", "fnlwgt", "Education", "Education-Num", 
                "Martial Status", "Occupation", "Relationship", "Race", "Sex", 
                "Capital-Gain", "Capital-Loss", "Hours-per-week", "Country", "Income"]

In [12]:
train = pd.read_csv('data/adult.data', names=column_names, sep=' *, *', na_values='?', 
                   engine='python') 

test = pd.read_csv('data/adult.test', names=column_names, sep=' *, *', skiprows=1, 
                   engine='python', na_values='?')

Now, we have our training and testing data. We can look at their first 5 rows using pandas **head()** function which gives the first 5 rows by default.

In [13]:
# Training Data  
train.head()

Unnamed: 0,Age,Workclass,fnlwgt,Education,Education-Num,Martial Status,Occupation,Relationship,Race,Sex,Capital-Gain,Capital-Loss,Hours-per-week,Country,Income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [14]:
# Testing Data  
test.head()

Unnamed: 0,Age,Workclass,fnlwgt,Education,Education-Num,Martial Status,Occupation,Relationship,Race,Sex,Capital-Gain,Capital-Loss,Hours-per-week,Country,Income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K.
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K.
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K.
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K.
4,18,,103497,Some-college,10,Never-married,,Own-child,White,Female,0,0,30,United-States,<=50K.


For the testing data, we notice that in the **Income** columns, the values are not in the same formats as in the training data. We can check it by using the **unique()** function applied to the corresponding column of both training and testing data.  

In [15]:
# Training  
train.Income.unique()

array(['<=50K', '>50K'], dtype=object)

In [16]:
# Testing 
test.Income.unique()

array(['<=50K.', '>50K.'], dtype=object)

We need to transform the **Income** column value for test data, in order to remove the **"."** at the end. So, the **[Numpy library](https://numpy.org/)** can help with its **where** function.  

In [18]:
import numpy as np

In [19]:
test.Income = np.where(test.Income == '<=50K.', '<=50K', '>50K')

In [20]:
# Check to make sure the change have been done.
test.Income.unique()

array(['<=50K', '>50K'], dtype=object)

Now, we have both our training and testing data in the same format. For further analysis, and processing, it might be better to combine the two datasets into one dataframe. Here again, pandas makes our life easier with **concat()** function.  The concatenation will be done horizontally (axis=0), and not vertically (axis=1). 

In [21]:
# Concatenate train and test. We will split it before the training phase 
df = pd.concat((train, test), axis=0) 

Let's look at the shapes (number of rows and columns) of both training, testing and the final dataframe.  

In [22]:
train_shape = train.shape
test_shape = test.shape
final_shape = df.shape

print("Training data shape: {} rows and {} columns".format(train_shape[0],
                                                          train_shape[1]))

print("Testing data shape: {} rows and {} columns".format(test_shape[0],
                                                          test_shape[1]))

print("Final data shape: {} rows and {} columns".format(final_shape[0],
                                                          final_shape[1]))

Training data shape: 32561 rows and 15 columns
Testing data shape: 16281 rows and 15 columns
Final data shape: 48842 rows and 15 columns


**Conclusion:** we have our final dataframe that will be used for  the next step. So we need to save it for the next step.  

In [23]:
print("Saving the final dataframe ...")
df.to_csv('data/combined_data.csv', index=False)

Saving the final dataframe ...


**Go to the next step with 2_Data-Munging**