# UFO Sighting and Data Preprocessing - DataCamp

This notebook contains a data cleaning workflow, using a UFO sighting dataset. It looks at various ways of cleaning data and how to best prepare your training/test sets for model creation. With the clean data we will predict which country any given UFO sighting took place between the USA and Canada.

<img src="Images/ufo.jpg" />

In [1]:
#Import required libraries
import pandas as pd
import numpy as np

#Loading the dataset, found on DataCamp website
ufo = pd.read_csv('PATH TO CSV')

#First pass look at the data
ufo.head(10)

Unnamed: 0,date,city,state,country,type,seconds,length_of_time,desc,recorded,lat,long
0,11/3/2011 19:21,woodville,wi,us,unknown,1209600.0,2 weeks,Red blinking objects similar to airplanes or s...,12/12/2011,44.9530556,-92.291111
1,10/3/2004 19:05,cleveland,oh,us,circle,30.0,30sec.,Many fighter jets flying towards UFO,10/27/2004,41.4994444,-81.695556
2,9/25/2009 21:00,coon rapids,mn,us,cigar,0.0,,Green&#44 red&#44 and blue pulses of light tha...,12/12/2009,45.12,-93.2875
3,11/21/2002 05:45,clemmons,nc,us,triangle,300.0,about 5 minutes,It was a large&#44 triangular shaped flying ob...,12/23/2002,36.0213889,-80.382222
4,8/19/2010 12:55,calgary (canada),ab,ca,oval,0.0,2,A white spinning disc in the shape of an oval.,8/24/2010,51.083333,-114.083333
5,6/16/2012 23:00,san diego,ca,us,light,600.0,10 minutes,Dancing lights that would fly around and then ...,7/4/2012,32.7152778,-117.156389
6,7/12/2009 21:30,duluth,mn,us,oval,600.0,total? maybe around 10 mi,A minor amber color trail&#44 (from where we w...,3/13/2012,46.7833333,-92.106389
7,10/20/2008 18:30,fairfield,tx,us,other,0.0,several sightings from 10,Multiple sightings in Central Texas (Freestone...,1/10/2009,31.7244444,-96.165
8,6/9/2013 00:00,oakville (canada),on,ca,light,120.0,2 minutes,Brilliant orange light or chinese lantern at o...,7/3/2013,43.433333,-79.666667
9,4/26/2013 23:27,lacey,wa,us,light,120.0,2 minutes,Bright red light moving north to north west fr...,5/15/2013,47.0344444,-122.821944


The table contains datetime, categorical, text and numeric data. A lot of cleaning is going to have to be done here to create a usable dataset. First lets look at the data types.

In [2]:
# Check the column types
ufo.dtypes

date               object
city               object
state              object
country            object
type               object
seconds           float64
length_of_time     object
desc               object
recorded           object
lat                object
long              float64
dtype: object

## Changing the Data Types

The 'date' feature in the dataframd is of type 'object'. We should change this to 'datetime' for better accessibilty to the data it contains.

In [3]:
# Change the date column to type datetime
ufo["date"] = pd.to_datetime(ufo['date'])

In [4]:
# Check the column types
ufo[['date']].dtypes

date    datetime64[ns]
dtype: object

<img src="Images/cantina.jpg" />

## Looking for and Removing Rows with Missing Values


Now that the data types are everything we want them to be, we should look for missing values in the data set.

In [5]:
#Looking for missing values
ufo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4935 entries, 0 to 4934
Data columns (total 11 columns):
date              4935 non-null datetime64[ns]
city              4926 non-null object
state             4516 non-null object
country           4255 non-null object
type              4776 non-null object
seconds           4935 non-null float64
length_of_time    4792 non-null object
desc              4932 non-null object
recorded          4935 non-null object
lat               4935 non-null object
long              4935 non-null float64
dtypes: datetime64[ns](1), float64(2), object(8)
memory usage: 424.2+ KB


Of the 11 features, only 5 features have no missing values with a total of 4935 elements in their columns. Here we will remove all rows where we know the length of time the UFO was seen, the state it was in and what type of UFO it was.

In [6]:
# Keep only rows where length_of_time, state, and type are not null
ufo = ufo[ufo['length_of_time'].notnull() & 
          ufo['state'].notnull() & 
          ufo['type'].notnull()]

As the length of time is in different units, we will create a new column that contains the units, and remove all the rows where the units are Null.

In [7]:
#Creating a column that contains the units from the 'length of time' column
ufo.loc[ufo['length_of_time'].str.contains('sec'), 'time_units'] = 'seconds'
ufo.loc[ufo['length_of_time'].str.contains('min'), 'time_units'] = 'minutes'
ufo.loc[ufo['length_of_time'].str.contains('week'), 'time_units'] = 'weeks'

ufo.head()

Unnamed: 0,date,city,state,country,type,seconds,length_of_time,desc,recorded,lat,long,time_units
0,2011-11-03 19:21:00,woodville,wi,us,unknown,1209600.0,2 weeks,Red blinking objects similar to airplanes or s...,12/12/2011,44.9530556,-92.291111,weeks
1,2004-10-03 19:05:00,cleveland,oh,us,circle,30.0,30sec.,Many fighter jets flying towards UFO,10/27/2004,41.4994444,-81.695556,seconds
3,2002-11-21 05:45:00,clemmons,nc,us,triangle,300.0,about 5 minutes,It was a large&#44 triangular shaped flying ob...,12/23/2002,36.0213889,-80.382222,minutes
4,2010-08-19 12:55:00,calgary (canada),ab,ca,oval,0.0,2,A white spinning disc in the shape of an oval.,8/24/2010,51.083333,-114.083333,
5,2012-06-16 23:00:00,san diego,ca,us,light,600.0,10 minutes,Dancing lights that would fly around and then ...,7/4/2012,32.7152778,-117.156389,minutes


In [8]:
ufo = ufo.dropna()

ufo.shape

(3406, 12)

## Gaining Time Values from The Time Column

The 'lengths of time' column contains different string characters. In this step we will look at extracting the values of time from all possible UFO sighting times using the regular expression package, more can be found on that [here](https://www.w3schools.com/python/python_regex.asp).

In [9]:
#import regular expression operations
import re

#Defining a function that returns the minute values
def return_time_values(time_string):

    # Use \d+ to grab digits
    pattern = re.compile(r"\d+")
    
    # Use match on the pattern and column
    num = re.match(pattern, time_string)
    if num is not None:
        return int(num.group(0))
        
# Apply the extraction to the length_of_time column
ufo["time_values"] = ufo["length_of_time"].apply(lambda row: return_time_values(row))

In [10]:
# Take a look at the head of both of the columns
print(ufo[['length_of_time', 'time_values', 'time_units']].head(10))

     length_of_time  time_values time_units
0           2 weeks          2.0      weeks
1            30sec.         30.0    seconds
3   about 5 minutes          NaN    minutes
5        10 minutes         10.0    minutes
8         2 minutes          2.0    minutes
9         2 minutes          2.0    minutes
10        5 minutes          5.0    minutes
11       10 minutes         10.0    minutes
12            2 min          2.0    minutes
13       30 seconds         30.0    seconds


Now we have the time values we can perform a conversion on them based upon their unit to determine what the time would be in minutes.

In [11]:
#Creating the minutes column
ufo.loc[ufo['time_units'] == 'seconds', 'minutes'] = ufo['time_values']/60
ufo.loc[ufo['time_units'] == 'minutes', 'minutes'] = ufo['time_values']
ufo.loc[ufo['time_units'] == 'weeks', 'minutes'] = ufo['time_values']*10080 

#removing the NaN values
ufo = ufo.dropna()

ufo.shape

(3148, 14)

<img src="Images/waiting.png" />

## Applying Log Normalisation to Features with Large Variance

Here we can apply a normalisation to some features. Let's look at the variance of two of our features, the given 'seconds' feature and recently created 'time_min' column.

In [12]:
# Check the variance of the seconds and minutes columns
print(ufo[['seconds', 'minutes']].var())

seconds    1.509768e+09
minutes    4.193635e+05
dtype: float64


Here we can see there is a huge variance in both the seconds and the minutes column (which makes sense as there were sightings reported from seconds up to weeks of UFO sightings). To make this variance more ML friendly we apply a log normalisation to the features. 

In [13]:
# Log normalize the seconds column
ufo["seconds_log"] = np.log(ufo['seconds'])
ufo["minutes_log"] = np.log(ufo['minutes'])

#Replacing any Infinite values with NaN
ufo["seconds_log"] = ufo['seconds_log'].replace([np.inf, -np.inf], np.nan)
ufo["minutes_log"] = ufo['minutes_log'].replace([np.inf, -np.inf], np.nan)

  result = getattr(ufunc, method)(*inputs, **kwargs)


In [14]:
#Removing the null values 
ufo = ufo[pd.notnull(ufo['seconds_log'])]
ufo = ufo[pd.notnull(ufo['minutes_log'])]

In [15]:
# Print out the variance of the normalised columns
print(ufo[["seconds_log", 'minutes_log']].var())

seconds_log    3.831132
minutes_log    3.885088
dtype: float64


Here we see there is a significant decrease in the variation between the two features. More information about scaling can be found [here](https://developers.google.com/machine-learning/data-prep/transform/normalization)

## Using Encoders on Categorical Features

Next we will look at turning Categorical features into numerical by using an encoder. More can be found here. First we will encode the countries as 1 or 0 for USA or Canada, and then encode the type of UFO seen.

In [16]:
# Use Pandas to encode us values as 1 and others as 0
ufo["country_enc"] = ufo["country"].apply(lambda x: 1 if x=='us' else 0)

In [17]:
# Print the number of unique type values
print(len(ufo['type'].unique()))

# Create a one-hot encoded set of the type values
type_set = pd.get_dummies(ufo['type'])

# Concatenate this set back to the ufo DataFrame
ufo = pd.concat([ufo, type_set], axis=1)

21


21 features have been added to the DataFrame, representing the 'type' of UFO seen. More information on feature encoding can be found [here](https://towardsdatascience.com/encoding-categorical-features-21a2651a065c)

## Converting Time to Months and Years Columns

Now we will convert the datatime column to have a look at the month and year variables independently. This can be done quite easily with python, and is explained [here](http://docs.python.org/2/library/datetime.html#datetime.datetime.strptime)

In [18]:
# Look at the first 5 rows of the date column
print(ufo['date'].head())

0   2011-11-03 19:21:00
1   2004-10-03 19:05:00
5   2012-06-16 23:00:00
8   2013-06-09 00:00:00
9   2013-04-26 23:27:00
Name: date, dtype: datetime64[ns]


In [19]:
# Extract the month from the date column
ufo["month"] = ufo["date"].apply(lambda row: row.month)

# Extract the year from the date column
ufo["year"] = ufo["date"].apply(lambda row: row.year)

# Take a look at the head of all three columns
print(ufo[['date', 'month', 'year']].head())

                 date  month  year
0 2011-11-03 19:21:00     11  2011
1 2004-10-03 19:05:00     10  2004
5 2012-06-16 23:00:00      6  2012
8 2013-06-09 00:00:00      6  2013
9 2013-04-26 23:27:00      4  2013


<img src="Images/alienpad.png" />

## Dropping Features we Don't Need

Now that the features are all clean it's time to decide which features the model will be built on. For the first step we'll assess the correlations between the seconds and minutes columns.

In [20]:
# Check the correlation between the seconds, seconds_log, and minutes columns
print(ufo[['seconds', 'seconds_log', 'minutes', 'minutes_log']].corr())

              seconds  seconds_log   minutes  minutes_log
seconds      1.000000     0.131936  0.999965     0.131503
seconds_log  0.131936     1.000000  0.130681     0.988823
minutes      0.999965     0.130681  1.000000     0.131115
minutes_log  0.131503     0.988823  0.131115     1.000000


Here we can see that there is a strong correlation between the seconds-minutes (for log scaled too). This makes sense as they're measurements of the same thing. In this case we can drop all but 1 time feature. Next we can look at other potential columns to drop and get rid of them. 

In [21]:
#Analysing the columns we have in the dataframe
ufo.columns

Index(['date', 'city', 'state', 'country', 'type', 'seconds', 'length_of_time',
       'desc', 'recorded', 'lat', 'long', 'time_units', 'time_values',
       'minutes', 'seconds_log', 'minutes_log', 'country_enc', 'changing',
       'chevron', 'cigar', 'circle', 'cone', 'cross', 'cylinder', 'diamond',
       'disk', 'egg', 'fireball', 'flash', 'formation', 'light', 'other',
       'oval', 'rectangle', 'sphere', 'teardrop', 'triangle', 'unknown',
       'month', 'year'],
      dtype='object')

In [22]:
# Make a list of features to drop
to_drop = ['date', 'city', 'country', 'state', 'type', 'seconds', 'length_of_time', 'desc', 'recorded', 'lat', 'long', 'time_units', 'time_values', 'minutes', 'seconds_log']

# Drop those features
ufo_dropped = ufo.drop(to_drop, axis=1)

ufo_dropped.columns

Index(['minutes_log', 'country_enc', 'changing', 'chevron', 'cigar', 'circle',
       'cone', 'cross', 'cylinder', 'diamond', 'disk', 'egg', 'fireball',
       'flash', 'formation', 'light', 'other', 'oval', 'rectangle', 'sphere',
       'teardrop', 'triangle', 'unknown', 'month', 'year'],
      dtype='object')

Here we can see all the features that remain after we've removed features we deem insignificant. From this we can create the model. In this case a K-Nearest-Neighbor classifier was selected.

<img src="Images/trump-ufo.jpg" />

## Running a KNN Model

In [23]:
#Subsetting features
X = ufo_dropped[['minutes_log', 'changing', 'chevron', 'cigar', 'circle',
       'cone', 'cross', 'cylinder', 'diamond', 'disk', 'egg', 'fireball',
       'flash', 'formation', 'light', 'other', 'oval', 'rectangle', 'sphere',
       'teardrop', 'triangle', 'unknown', 'month', 'year']]

#Subsetting the target variable
y = ufo_dropped['country_enc']

Now we have split features we can create a model that classifies whether a UFO was spotted in the USA or in Canada

In [24]:
from sklearn.model_selection import train_test_split

# Split the X and y sets using train_test_split, setting stratify=y
train_X, test_X, train_y, test_y = train_test_split(X, y, stratify=y)

In [25]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()

# Fit knn to the training sets
knn.fit(train_X, train_y)

# Print the score of knn on the test sets
print(knn.score(test_X, test_y))

0.9491740787801779


The model returned a great accuracy of 95%. All data and workflow was based on the Preprocessing for Machine Learning in Python course on DataCamp.