In [1]:
import pandas as pd

In [2]:
# The data URI
csv_file_uri = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"

### Data overview

```
>50K, <=50K.

age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
```

Loading the data

In [3]:
column_names = [
    "age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
    "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss",
    "hours-per-week", "native-country", "target"
]


data_original = pd.read_csv(csv_file_uri, names=column_names, index_col=False)
# Make a copy so that we always have the original data to refer to
data = data_original.copy(deep=True)
# Drop the US weights (don't have any value)
data.drop(["fnlwgt"], axis=1, inplace=True)

# Show the head rows of the table at this stage.
data.head(3)

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K


In [4]:
# How big is the dataset?
data.shape

(32561, 14)

The variable we want to predict for classification is the **target**. The problem now is that it is text and we can not work with text in most models.

Here is one way to change a column in a pandas dataframe (using the `apply` method)

In [None]:
# Create a function that changes the text to a simple binary value
def convert_target_variable(text):
    if text == " <=50K":
        return 0
    else:
        return 1

data["target"] = data.target.apply(convert_target_variable)

data.head(3)

In [None]:
# To check how many people make less than 50k and how many make more
data.target.value_counts()

### Data Preprocessing

At this point we have a big problem with our data. Algorithms (most of them) can only handle data (as they rely on mathematics). For this reason we need to get rid of any column that is a text variable and change it to a numerical one.

You can use one of many classes from the machine learning toolkit **scikit-learn**

In [None]:
from sklearn import preprocessing

In [None]:
encoder = preprocessing.LabelEncoder()

data["race_encoded"] = encoder.fit_transform(data.race.values)
data.drop(["race"], axis=1, inplace=True)

# View your new column.
data.head(3)

**Discuss what a Label Encoder does, make sure you understand how it works**

We have a whole bunch of these columns (run the following block to see which), so lets encode them all.

In [None]:
data.dtypes

In [None]:
encoded_columns = []
for c in data.columns:
    if data[c].dtype == "object":
        if "{}_encoded".format(c) not in data.columns:
            encoder = preprocessing.LabelEncoder()
            data["{}_encoded".format(c)] = encoder.fit_transform(data[c].values)
            encoded_columns.append(c)
            encoder = None
        else:
            print("{}_encoded already exists".format(c))

print("Dropping the encoded columns {}".format(encoded_columns))
data.drop(encoded_columns, axis=1, inplace=True)

In [None]:
# Check out the new numerical data table.
data.head()

In [None]:
# All available column names
data.columns

### Challenge

In [None]:
# Don't use the LabelEncoder but use one-hot-encoding instead.
# For this you will need to use the pandas function pd.get_dummies
# to encode and either a dataframe join or merge to merge the dataframes
# inside the loop.

In [None]:
# Make a copy so that we always have the original data to refer to
data_v2 = data_original.copy(deep=True)

data_v2_dummies = pd.get_dummies(data_v2)

print(data_v2_dummies.head())

In [None]:
# Deletes the original column in this dataframe.
# We need to delete one of the target columns since we only need one.
# I am deleting the under 50k one because then we keep the column where
# the value 1 represents people who make more than 50k and the value 0
# represents people who make less than 50k
data_v2_dummies.drop(["target_ <=50K"], axis=1, inplace=True)
# Remove the US cencus weights
data_v2_dummies.drop(["fnlwgt"], axis=1, inplace=True)

# Rename the target
data_v2_dummies.rename(columns={'target_ >50K': 'target' }, inplace=True)

print(data_v2_dummies.head())

---
### Additional resources

* [10 Minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/10min.html)
* [Intro to Pandas: -1 : An absolute beginners guide to Machine Learning and Data science.](https://hackernoon.com/intro-to-pandas-1-an-absolute-beginners-guide-to-machine-learning-and-data-science-a1fed3a6f0f3)
* [Introduction to Pandas with Practical Examples](http://pythonforengineers.com/introduction-to-pandas/)
* [Pandas Tutorial: DataFrames in Python](https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python)
* [Python Pandas Tutorial](https://www.tutorialspoint.com/python_pandas/)