**Getting Started**
***Goals***
- Install Software Requirements for the labs
- Learn how to load the dataset
- Understand dataset structure
- Plotting the data
- Performing manipulation on the data

** Installing Software Requirements **

Here is a list of what you'll need:
- Python3 [download](https://www.python.org/downloads/)
- Pandas for data manipulation
- Numpy for computations
- Tensorflow and keras framework for Machine Learning
- Matplotlib for plottin

Once all the software requirements are install, load the following, and you should see the message "Successfully installed all required components" printed

In [2]:
%matplotlib inline
import numpy as np
import tensorflow as tf
import keras
import pandas as pd
import matplotlib.pyplot as plt
import os
print("Successfully installed all required components")

Successfully installed all required components


Using TensorFlow backend.


**Loading the UCI Adultdataset**

In this tutorial, we will be interacting with the UCI Adult Dataset. We will be using in a local version of the dataset that we download from [here](https://archive.ics.uci.edu/ml/datasets/adult)

The dataset is currently stored into the "data/adult" local directory.


**Define dataset path in variables**

In [3]:
ADULT_DIRECTORY = os.path.join(os.getcwd(), 'data','adult')
ADULT_DATA_PATH = os.path.join(ADULT_DIRECTORY, 'adult.data')
ADULT_TEST_PATH = os.path.join(ADULT_DIRECTORY, 'adult.test')

In [4]:
COLUMN_NAMES = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation", 
                "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "salary"]

** Load the datasets ** 

In [5]:
adult_data = pd.read_csv(ADULT_DATA_PATH, names=COLUMN_NAMES)
adult_test = pd.read_csv(ADULT_TEST_PATH, names=COLUMN_NAMES)

In [6]:
adult_data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [8]:
adult_data.shape

(32561, 15)

In [10]:
adult_test.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,|1x3 Cross validator,,,,,,,,,,,,,,
1,25,Private,226802.0,11th,7.0,Never-married,Machine-op-inspct,Own-child,Black,Male,0.0,0.0,40.0,United-States,<=50K.
2,38,Private,89814.0,HS-grad,9.0,Married-civ-spouse,Farming-fishing,Husband,White,Male,0.0,0.0,50.0,United-States,<=50K.
3,28,Local-gov,336951.0,Assoc-acdm,12.0,Married-civ-spouse,Protective-serv,Husband,White,Male,0.0,0.0,40.0,United-States,>50K.
4,44,Private,160323.0,Some-college,10.0,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688.0,0.0,40.0,United-States,>50K.


In [11]:
adult_test.shape

(16282, 15)

In [13]:
adult_data.describe(include='all')

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
count,32561.0,32561,32561.0,32561,32561.0,32561,32561,32561,32561,32561,32561.0,32561.0,32561.0,32561,32561
unique,,9,,16,,7,15,6,5,2,,,,42,2
top,,Private,,HS-grad,,Married-civ-spouse,Prof-specialty,Husband,White,Male,,,,United-States,<=50K
freq,,22696,,10501,,14976,4140,13193,27816,21790,,,,29170,24720
mean,38.581647,,189778.4,,10.080679,,,,,,1077.648844,87.30383,40.437456,,
std,13.640433,,105550.0,,2.57272,,,,,,7385.292085,402.960219,12.347429,,
min,17.0,,12285.0,,1.0,,,,,,0.0,0.0,1.0,,
25%,28.0,,117827.0,,9.0,,,,,,0.0,0.0,40.0,,
50%,37.0,,178356.0,,10.0,,,,,,0.0,0.0,40.0,,
75%,48.0,,237051.0,,12.0,,,,,,0.0,0.0,45.0,,


# Selecting and Filtering Data
* Select columns you are interested in

In [None]:
columns_you_want = ['start_date', 'end_date'] # specify columns you're interested in
chosen_columns = hubway_data[columns_you_want] # select the columns
chosen_columns.head() # show first couple of lines of this new variable 

Filter the data - let's get the data of all people under age 40

In [None]:
millennials = hubway_data[hubway_data.birth_date > 1982] # select the birth_date column and only take entries that are larger than year 1982
millennials.head() # display first couple of lines

# Split Data by Groups
* [Split](https://pandas.pydata.org/pandas-docs/stable/groupby.html) bike trips by type of user (registered vs. casual)
* Are the bike trips between registered and casual users different in duration?

In [None]:
grouped_data = hubway_data.groupby('subsc_type') # split data
grouped_data.mean()['duration'] # calculate the mean for the variable duration

# Applying a Function
* Apply a function to a column of a DataFrame
* Let's transform start date and birth date to get user's age

In [None]:
def get_age(x):
    """
    Calculate the age of the user.
    x : birth date and start date in that order
    """
    birthdate = x[0]
    startdate = x[1]
    # start date comes in the form of "7/28/2011 10:12:00"
    # get just the year digits
    check_out_year = int(startdate[-13:-9])
    age = check_out_year - birthdate
    return age

In [None]:
hubway_data['age'] = hubway_data[['birth_date', 'start_date']].apply(get_age, axis=1)
hubway_data['age'].head()

# DATA VISUALIZATION
* identify hidden patterns and trends
* formulate hypothesis
* determine best steps for modeling
* communicate results

Let's explore users that only ride for a relatively short amount of time (less than 2 hours). We will also remove trips where some information about the trip is missing using [dropna() function](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) to make plotting easier.

short_distance_trips = hubway_data[hubway_data.duration < 7200].dropna()
short_distance_trips.head()

* Scatter plots - numerical data - useful for exploring correlations in data
* Age vs. duration of bike trip 

In [None]:
plt.scatter(short_distance_trips['age'], short_distance_trips['duration'])
plt.title('Scatter plot of Duration by User Ages')
plt.xlabel('Age in years')
plt.ylabel('Duration (in seconds)')

* Histograms - distribution of the variable
* Useful for identifying outliers, multi-modality

In [None]:
plt.hist(short_distance_trips['age'], bins = 10)
plt.axvline(short_distance_trips['age'].mean(), color='green', label='Average Age')
plt.legend()
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Histogram of User Age')

* Bar plot - useful for categorical data
* Let's [obtain counts](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) of number of users by gender and plot them

In [None]:
plt.bar(range(2), gender_counts, align='center', color=['gray', 'teal'])
plt.xticks([0, 1], ['male', 'female'])
plt.title('Users by Gender')

# DATA CLEANING
* Wrong values
* Messy format
* Too many observations - do preliminary analysis on a subset of data
* Missing data
* Drop samples with problematic values
* Use mean, median or most common value of the feature
* Use a model to estimate the value
* Data might not be missing at random

# Dropping missing values

In [None]:
hubway_data_droppped = hubway_data.dropna()

# Dropping wrong values
* Explore outliers using histograms
* Trip duration should be a positive number
* Trip duration cannot be too long

In [None]:
plt.hist(hubway_data.duration.dropna())
plt.xlabel('Duration')
plt.ylabel('Frequency')

In [None]:
print('Minimum duration = ', np.min(hubway_data.duration))
print('Maximum duration = ', np.max(hubway_data.duration))

Currently the maximum trip duration is 11994458 seconds, which is approximately 138 days. Something must have gone wrong during the data recording process. The minimum trip duration is a negative value, which cannot occur. 
Let's filter data by duration:
* Trip duration has to be positive and probably less than 8 hours (28800 seconds)

In [None]:
hubway_data_clean = hubway_data[(hubway_data.duration > 0) & (hubway_data.duration < 28800)] 
hubway_data_clean.shape

In [None]:
plt.hist(hubway_data_clean.duration.dropna())
plt.xlabel('Duration')
plt.ylabel('Frequency')

# Filling in missing data with summary statistics
* Impute missing data by replacing it with mean, median or the most frequent value
* Most frequent value could be a good choice for categorical data
* Imputation reduces variability within the dataset, which will impact your model's performance
* Evaluate which imputation technique gives the best perfomance
* Let's impute user's age and trip duration using the most frequent value
* We will specify the parameters for the [imputer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html)
* Fit the imputer (find most frequent value in this case) and transform the data accordingly

In [None]:
imp = preprocessing.Imputer(missing_values='NaN', strategy='most_frequent') # specify the imputer
hubway_data[['age', 'duration']] = imp.fit_transform(hubway_data[['age', 'duration']]) # fit the imputer and transform the input

# Filling in missing data with k-nearest neighbors (k-NN)
* Fit a model to the data that is not missing
* Use the model to predict the values for missing data
* k-NN finds $k$ samples closest in distance to the missing point and predicts the label from these closest points
* k-NN classification: output is a category decided by majority vote of its $k$ neighbors
* k-NN regression: output is the average of the values of its $k$ nearest neighbors
* Weigh contribution of each point can be weighted by its distance from the point of interest
* Distance metric matters, number of neighbors matters
* Pick parameters that give you best performance on the final task

We will use the duration column to impute missing birth dates.

In [None]:
model = neighbors.KNeighborsRegressor(n_neighbors=10, weights = 'distance')
model.fit(hubway_data_droppped[['duration']], hubway_data_droppped[['birth_date']])

In [None]:
missing_birth_dates = hubway_data[pd.isnull(hubway_data['birth_date'])]['duration']
imputed_birthdates = model.predict(missing_birth_dates.values.reshape(-1, 1))

In [None]:
plt.hist(imputed_birthdates, bins = 20)
plt.xlabel('Imputed birth date values')
plt.ylabel('Frequency')