# Preprocessing Data for Machine Learning

Outline:

* Exploring and Preparing Data
  - Checking Data Types
  - Removing Columns/Rows
  - Splitting Dataset
* Feature Scaling
  - Log normalization
  - Scaling
* Feature Engineering
  - Numerical Features
  - Categorical Features
  - Text
 * Feature Selection
 
 Dataset: UFO sightings

In [1]:
import pandas as pd

import re # Regular Expressions

## Exploring and Preparing Data

In [11]:
# Read in the file
df = pd.read_csv('datasets/ufo.csv')

# Print the first 3 rows
df.head(3)

Unnamed: 0,date,city,state,country,type,seconds,length_of_time,desc,recorded,lat,long
0,11/3/2011 19:21,woodville,wi,us,unknown,1209600.0,2 weeks,Red blinking objects similar to airplanes or s...,12/12/2011,44.9530556,-92.291111
1,10/3/2004 19:05,cleveland,oh,us,circle,30.0,30sec.,Many fighter jets flying towards UFO,10/27/2004,41.4994444,-81.695556
2,9/25/2009 21:00,coon rapids,mn,us,cigar,0.0,,Green&#44 red&#44 and blue pulses of light tha...,12/12/2009,45.12,-93.2875


### Data Types

In [3]:
# Print the data types of the df
df.dtypes

date               object
city               object
state              object
country            object
type               object
seconds           float64
length_of_time     object
desc               object
recorded           object
lat                object
long              float64
dtype: object

In [15]:
# Change the date column to type datetime
df["date"] = pd.to_datetime(df["date"])

pandas.core.series.Series

In [None]:
# Convert the 'lat' column to type int
# df['lat'] = df['lat'].astype(float)

### Removing Columns/Rows

In [13]:
# Print the info for for the df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4935 entries, 0 to 4934
Data columns (total 11 columns):
date              4935 non-null datetime64[ns]
city              4926 non-null object
state             4516 non-null object
country           4255 non-null object
type              4776 non-null object
seconds           4935 non-null float64
length_of_time    4792 non-null object
desc              4932 non-null object
recorded          4935 non-null object
lat               4935 non-null object
long              4935 non-null float64
dtypes: datetime64[ns](1), float64(2), object(8)
memory usage: 424.2+ KB


In [16]:
df.type.isnull().head()

0    False
1    False
2    False
3    False
4    False
Name: type, dtype: bool

In [45]:
# Number of missing values in all columns
df.isnull().sum()

date                0
city                9
state             419
country           680
type              159
seconds             0
length_of_time    143
desc                3
recorded            0
lat                 0
long                0
dtype: int64

In [7]:
# Print the number of missing values in the 'city' column
print(df.city.isnull().sum())

# print the shape of the df
print(df.shape)

# Subset the dataset with missing values in the 'city' column and print the shape of the new df
print(df[df.city.isnull()].shape)

# Subset the dataset with non-missing values in the 'city' column and print the shape of the new df
print(df[df.city.notnull()].shape)

9
(4935, 11)
(9, 11)
(4926, 11)


In [32]:
# Keep only rows where length_of_time, state, and type are not null
df2 = df[df.length_of_time.notnull() & 
          df.state.notnull() & 
          df.type.notnull()]

# Print out the shape of the new dataset
df2.shape

(4283, 11)

We can also use the `dropna()` function to remove data, with parameters `axis=0` for rows and `thresh=` for the desired threshold. For instance, when threshold is 8, we'll drop the columns/features which have less than 8 non-missing values (or at least 4 missing values).

In [50]:
df.dropna(axis=0, thresh=8).shape

(4928, 11)

In [52]:
# Number of missing values in all columns
df2.isnull().sum()

date                0
city                0
state               0
country           392
type                0
seconds             0
length_of_time      0
desc                0
recorded            0
lat                 0
long                0
dtype: int64

`length_of_time` column has text from which we can extranct minute values.

In [56]:
# Print the head of the length_of_time column
df2.length_of_time.head()

0            2 weeks
1             30sec.
3    about 5 minutes
4                  2
5         10 minutes
Name: length_of_time, dtype: object

In [90]:
# import re

def return_minutes(time_string):

    # Use \d+ to grab digits
    pattern = re.compile(r"([\d+).min")
    
    # Use match on the pattern and column
    num = re.match(pattern, time_string)
    if num is not None:
        return int(num.group(1)) # return the first phranthesized subgroup
        
# Apply the extraction to the length_of_time column
df2["minutes"] = df2["length_of_time"].apply(lambda row: return_minutes(row))

# Print the head of both of the columns
df2[['length_of_time', 'minutes']].head()

IndexError: no such group

## Feature Scaling

![Big and Small](https://eazybi.com/static/img/blog_page/posts/2015_12_14/small_vs_big.jpg "Big and Small")

We may need to standardize the range of independent variables (or features) in order to increase predicton accuracy, especially when the range of values of different features varies significantly. Standardization is required for certain Machine Learning algorithms (such as K-nearest neighbors and SVM) to work properly.

$$ z = {x - \mu \over \sigma} $$

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(X_train, y_train)

### Log Normalization

In [None]:
# Import StandardScaler from scikit-learn
from sklearn.preprocessing import StandardScaler

## Feature Engineering

### Numerical Features

### Categorical Features

When a variable/feature has two categories, we can use `LabelEncoder()` to encode it into 1s and 0s. When it has more than two categories, then we can perform one-hot encoding with `get_dummies()`.

In [None]:
from sklearn.preprocessing import LabelEncoder


### Text

### Splitting Dataset

In [53]:
df2.describe()

Unnamed: 0,seconds,long
count,4283.0,4283.0
mean,5309.563,-94.406454
std,124306.5,20.297169
min,0.0,-170.478889
25%,20.0,-113.993056
50%,180.0,-88.987778
75%,600.0,-80.118357
max,6312000.0,117.897392


In [None]:
df_X = df.drop(['xx'], axis = 1).values # features
df_y = df.xx.values

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)


#### Stratified Sampling

If the distribution of a column is uneven, we may use stratifying in order to train the model on a sample which is representative of the dataset

In [None]:
y["labels"].value_counts()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)