<center>
<h1> Assignment: Data Preprocessing and Modeling</h1>
<hr>
<h2>UFO Sighting Data Exploration</h2>
<hr>

## 1. Import dataset "ufo_sightings_large.csv" in pandas (5 points)

In [1]:
import pandas as pd
import numpy as np

In [2]:
ufo = pd.read_csv('ufo_sightings_large.csv')
ufo.head()

Unnamed: 0,date,city,state,country,type,seconds,length_of_time,desc,recorded,lat,long
0,11/3/2011 19:21,woodville,wi,us,unknown,1209600.0,2 weeks,Red blinking objects similar to airplanes or s...,12/12/2011,44.9530556,-92.291111
1,10/3/2004 19:05,cleveland,oh,us,circle,30.0,30sec.,Many fighter jets flying towards UFO,10/27/2004,41.4994444,-81.695556
2,9/25/2009 21:00,coon rapids,mn,us,cigar,0.0,,Green&#44 red&#44 and blue pulses of light tha...,12/12/2009,45.12,-93.2875
3,11/21/2002 05:45,clemmons,nc,us,triangle,300.0,about 5 minutes,It was a large&#44 triangular shaped flying ob...,12/23/2002,36.0213889,-80.382222
4,8/19/2010 12:55,calgary (canada),ab,ca,oval,0.0,2,A white spinning disc in the shape of an oval.,8/24/2010,51.083333,-114.083333


## 2. Checking column types & Converting Column types (10 points)
Take a look at the UFO dataset's column types using the dtypes attribute. Please convert the column types to the proper types.
For example, the date column, which can be transformed into the datetime type. 
That will make our feature engineering efforts easier later on.

In [3]:
# Check type of column inputs
print(ufo.dtypes)

# Convert the type of seconds to float type
ufo['seconds'] = ufo['seconds'].astype(float)

# Change the type of  date column to datetime
ufo['date'] = pd.to_datetime(ufo['date'])

# Check type of seconds and date column inputs
print(ufo[['seconds', 'date']].dtypes)

date               object
city               object
state              object
country            object
type               object
seconds           float64
length_of_time     object
desc               object
recorded           object
lat                object
long              float64
dtype: object
seconds           float64
date       datetime64[ns]
dtype: object


## 3. Dropping missing data (10 points)
Let's remove some of the rows where certain columns have missing values. 

In [4]:
# Check missing values in the length_of_time, state, and type columns

print(ufo[['length_of_time', 'state', 'type']].isnull().sum())

# Keep only non-null rows in selected columns
ufo_no_missing = ufo[ufo['length_of_time'].notnull() &
                     ufo['state'].notnull() & 
                     ufo['type'].notnull()]

# Print new shape of the refined dataset
print(ufo_no_missing.shape)

length_of_time    143
state             419
type              159
dtype: int64
(4283, 11)


## 4. Extracting numbers from strings (10 points)
The <b>length_of_time</b> column in the UFO dataset is a text field that has the number of 
minutes within the string. 
Here, you'll extract that number from that text field using regular expressions.

In [5]:
import re
import math

ufo = pd.read_csv('ufo_sample.csv')

# Change the column type of seconds to float
ufo['seconds'] = ufo['seconds'].astype(float)

# Change the type of  date column to datetime
ufo['date'] = pd.to_datetime(ufo['date'])

def return_minutes(time_string):
    # Use \d+ to grab any n number of  digits
    pattern = re.compile(r'\d+')
    num = re.match(pattern, time_string)  # Use match  on the pattern and column
    if num is not None:
        return int(num.group(0))
    
# Extract the numerical datafrom length_of_time
ufo['minutes'] = ufo['length_of_time'].apply(lambda row: return_minutes(row))

# check head of both columns again
print(ufo[['length_of_time', 'minutes']].head(10))

    length_of_time  minutes
0  about 5 minutes      NaN
1       10 minutes     10.0
2        2 minutes      2.0
3        2 minutes      2.0
4        5 minutes      5.0
5       10 minutes     10.0
6        5 minutes      5.0
7        5 minutes      5.0
8        5 minutes      5.0
9          1minute      1.0


## 5. Identifying features for standardization (10 points)
In this section, you'll investigate the variance of columns in the UFO dataset to 
determine which features should be standardized. You can log normlize the high variance column.

In [6]:
# look for variance of the seconds and minutes two columns
print(ufo[['seconds', 'minutes']].var())

# Apply Log normalize the seconds column
ufo['seconds_log'] = np.log(ufo['seconds'])

# Print out the variance of  newly added seconds_log column from above step
print(ufo['seconds_log'].var())

seconds    424087.417474
minutes       117.546372
dtype: float64
1.1223923881183004


## 6. Encoding categorical variables (20 points)
There are couple of columns in the UFO dataset that need to be encoded before they can be 
modeled through scikit-learn. 
You'll do that transformation here, <b>using both binary and one-hot encoding methods</b>.

In [7]:
# Encode 'us' values as 1 whereas others as 0
ufo['country_enc'] = ufo['country'].apply(lambda x: 1 if x == 'us' else 0)

# Print number of unique values in type column
print(len(ufo['type'].unique()))

# Use Getdummies to get one-hot encoded set of type column values
type_set = pd.get_dummies(ufo['type'])

# Concatenate this set back to original ufo DataFrame created in first place
ufo = pd.concat([ufo, type_set], axis=1)

21


In [8]:
# Look at the type of date column
print(ufo['date'].dtypes)

# Extract month from the date column
ufo['month'] = ufo['date'].apply(lambda date: date.month)

# Extract the year from the date column
ufo['year'] = ufo['date'].apply(lambda date: date.year)

# Check new head of all three columns
print(ufo[['date', 'month', 'year']].head())

datetime64[ns]
                 date  month  year
0 2002-11-21 05:45:00     11  2002
1 2012-06-16 23:00:00      6  2012
2 2013-06-09 00:00:00      6  2013
3 2013-04-26 23:27:00      4  2013
4 2013-09-13 20:30:00      9  2013


## 7. Text vectorization (10 points)
Let's transform the <b>desc</b> column in the UFO dataset into tf/idf vectors, 
since there's likely something we can learn from this field.

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

print(ufo['desc'].head()) # Look at the head of the desc field

vec = TfidfVectorizer()   # Create the tfidf vectorizer object

desc_tfidf = vec.fit_transform(ufo['desc']) # Use vec's fit_transform on the desc

print(desc_tfidf.shape) # Check the shape of  desc after above operations

0    It was a large&#44 triangular shaped flying ob...
1    Dancing lights that would fly around and then ...
2    Brilliant orange light or chinese lantern at o...
3    Bright red light moving north to north west fr...
4    North-east moving south-west. First 7 or so li...
Name: desc, dtype: object
(1866, 3422)


## 8. Selecting the ideal dataset (10 points)
Let's get rid of some of the unnecessary features. 

In [15]:
def return_weights(vocab, original_vocab, vector, vector_index, top_n):
    zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data))
    '''
    This will help transform the zipped dict into a series format
    '''
    zipped_series = pd.Series({vocab[i]:zipped[i] for i in vector[vector_index].indices})
    '''
    This helps to sort the series to pull out top weighted words
    '''
        zipped_index = zipped_series.sort_values(ascending=False)[:top_n].index
    return [original_vocab[i] for i in zipped_index]

def words_to_filter(vocab, original_vocab, vector, top_n):
    filter_list = []
    for i in range(0, vector.shape[0]):
        '''
        This calls the function from the previous exercise, 
        and extend the list we're creating
        '''
        try:
            filtered = return_weights(vocab, original_vocab, vector, i, top_n)
            filter_list.extend(filtered)
        except:
            pass
    return set(filter_list)

In [16]:
vocab_csv = pd.read_csv('vocab_ufo.csv', index_col=0).to_dict()
vocab = vocab_csv['0']

In [17]:
# See the Correlation between the seconds, seconds_log, and minutes columns
print(ufo[['seconds', 'seconds_log', 'minutes']].corr())

# Define a list of features to drop
to_drop = ['city', 'country', 'date', 'desc', 'lat', 
           'length_of_time', 'seconds', 'minutes', 'long', 'state', 'recorded']

# Drop those features listed above
ufo_dropped = ufo.drop(to_drop, axis=1)

# Let's also filter some words out of the text vector we created
filtered_words = words_to_filter(vocab, vec.vocabulary_, desc_tfidf, top_n=4)

              seconds  seconds_log   minutes
seconds      1.000000     0.853371  0.980341
seconds_log  0.853371     1.000000  0.824493
minutes      0.980341     0.824493  1.000000


In [18]:
ufo_dropped

Unnamed: 0,type,seconds_log,country_enc,changing,chevron,cigar,circle,cone,cross,cylinder,...,light,other,oval,rectangle,sphere,teardrop,triangle,unknown,month,year
0,triangle,5.703782,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,11,2002
1,light,6.396930,1,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,6,2012
2,light,4.787492,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,6,2013
3,light,4.787492,1,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,4,2013
4,sphere,5.703782,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,9,2013
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1861,unknown,7.901007,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,8,2002
1862,oval,5.703782,1,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,7,2013
1863,changing,5.192957,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,11,2008
1864,circle,5.192957,1,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,6,1998


In [19]:
X = ufo_dropped.drop(['type', 'country_enc'], axis=1)
y = ufo_dropped['country_enc']

## 9. Split the X and y using train_test_split, setting stratify = y (5 points)

In [20]:
print(X.columns)

Index(['seconds_log', 'changing', 'chevron', 'cigar', 'circle', 'cone',
       'cross', 'cylinder', 'diamond', 'disk', 'egg', 'fireball', 'flash',
       'formation', 'light', 'other', 'oval', 'rectangle', 'sphere',
       'teardrop', 'triangle', 'unknown', 'month', 'year'],
      dtype='object')


In [21]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()

# Split X and y sets 
train_X, test_X, train_y, test_y = train_test_split(X, y, stratify=y)

# Fit knn to the training sets
knn.fit(train_X, train_y)

# Print the score of knn on the test sets
print(knn.score(test_X, test_y))

0.8779443254817987


## 10. Fit knn to the training sets and print the score of knn on the test sets

In [22]:
y = ufo_dropped['type']

In [23]:
from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()

# Use the list of filtered words to filter the text vector as follows
filtered_text = desc_tfidf[:, list(filtered_words)]

# Split X and y sets 
train_X, test_X, train_y, test_y = train_test_split(filtered_text.toarray(), y, stratify=y)

# Fit nb to the training sets
nb.fit(train_X, train_y)

# Print the score of nb on the test sets
print(nb.score(test_X, test_y))

0.22055674518201285
