# Imports

Due to the large use of classifiers in this modeling portion, the imports are quite extensive in order to show the right model that would work best for the data that is provided. Due to the large processing power that is needed for the complete infomraiton provided, I lowered the number of columns to a concise number that kept the memory usage under 1 GB.

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings('ignore')

## Data Configuring

In order to create a dataset that is ready for using the linear models, a lot of the fields needed the use of get_dummies. By transforming this data with the use of `pd.get_dummies`, the classiers can use all the data, beacause all data passed needs to be an integer or float value.

In [2]:
train = pd.read_csv('/data/glirios/Pokemon.csv', index_col=0)

In [3]:
train.head()

Unnamed: 0,pokemonId,latitude,longitude,appearedTimeOfDay,appearedDayOfWeek,appearedLocalTime,appearedHour,appearedDay,city,temperature,continent,population_density,class
0,16,20.525745,-97.460829,night,Monday,2016-09-08T03:57:45,5,8,Mexico_City,25.5,America,2431.2341,16
1,133,20.523695,-97.461167,night,Monday,2016-09-08T03:57:37,5,8,Mexico_City,25.5,America,2431.2341,133
2,16,38.90359,-77.19978,night,Monday,2016-09-08T03:57:25,5,8,New_York,24.2,America,761.8856,16
3,13,47.665903,-122.312561,night,Monday,2016-09-08T03:56:22,5,8,Los_Angeles,15.6,America,4842.1626,13
4,133,47.666454,-122.311628,night,Monday,2016-09-08T03:56:08,5,8,Los_Angeles,15.6,America,4842.1626,133


In [4]:
train = train.drop(['appearedDayOfWeek'],1)

Since there have been some issues with time discrepency, this portion of the code is meant to correct the time values presented in the dataset to present the correct time value.

In [5]:
#Noticed that the appeared Hour/Minute/Day/Month/Year weren't consisted with appearedLocalTime.
train = train.drop(['appearedHour', 'appearedDay'],1)

#Convert appearedLocalTime string to DateTime
train['appearedLocalTime'] =  pd.to_datetime(train['appearedLocalTime'], format='%Y-%m-%dT%H:%M:%S')
#Now reinstate the appeared Hour/Minute/Day/Month/Year, then drop appearedLocalTime
train['appearedHour'] = train['appearedLocalTime'].dt.hour
train['appearedMinute'] = train['appearedLocalTime'].dt.minute
train['appearedDay'] = train['appearedLocalTime'].dt.day
train['appearedMonth'] = train['appearedLocalTime'].dt.month
train['appearedYear'] = train['appearedLocalTime'].dt.year
train = train.drop(['appearedLocalTime'],1)

Now that the time is correct, I can now start using the string values with `pd.get_dummies` to get integer values that will be necessary for the modeling. 

In [6]:
#Now use 1-of-K encoding using pd.get_dummies()
Hour = pd.get_dummies(train.appearedHour, drop_first=True, prefix='hour')
Minute = pd.get_dummies(train.appearedMinute, drop_first=True, prefix='minute')
Day = pd.get_dummies(train.appearedDay, drop_first=True, prefix='day')
Month = pd.get_dummies(train.appearedMonth, drop_first=True, prefix='month')
Year = pd.get_dummies(train.appearedYear, drop_first=True, prefix='year')
train = train.join(Hour)         #To avoid dummy variable trap
train = train.join(Minute)
train = train.join(Day)
train = train.join(Month)
train = train.join(Year)
#Now we drop the appearedTimeX feature
train = train.drop(['appearedHour', 'appearedMinute', 'appearedDay', 'appearedMonth', 'appearedYear'],1)

For `train['appearedTimeOfDay']`, I will only be changing the calues to ordinal. 

In [7]:
#Converting appearedTimeofDay into ordinal
time_mapping = {"morning": 0, "afternoon": 1, "evening": 2, "night": 3}
train['appearedTimeOfDay'] = train['appearedTimeOfDay'].map(time_mapping)

Next is the cities.

In [8]:
#Get dummies on cities
cities = pd.get_dummies(train.city, drop_first=True, prefix='city')
train = train.join(cities)         #To avoid dummy variable trap
#Now we drop the city feature
train = train.drop(['city'],1)

Lastly, I will be chainging the `continent` information to better represent the world as a whole. For example, Inidiana and Kentucky are located in America so it would make sense to replace the values to better represent the world as whole.

In [9]:
#redefining continents such that they correspond to the main 7 continents (no Antartica, yes Indian)
train.continent[train['continent']=='America/Indiana']='America'
train.continent[train['continent']=='America/Kentucky']='America'
train.continent[train['continent']=='Pacific']='Australia'
train.continent[train['continent']=='Atlantic']='Europe'
train.continent[train['continent']=='America/Argentina']='CentralAmerica'
#Then change them to dummies
Continent = pd.get_dummies(train.continent, drop_first=True, prefix='continent')
train = train.join(Continent)         #To avoid dummy variable trap
#Now we drop the continent feature
train = train.drop(['continent'],1)

Lastly, I will make sure that the class and pokemonId match.

In [10]:
#Making sure that pokemonID (the first column)) and class (the last column) are the same
row_ids = train[train['class'] != train.pokemonId].index        #This yields an empty set --> identical columns
#So now drop one of them and keep the other (for now) to use as the labels
train.drop(['class'],1)

Unnamed: 0,pokemonId,latitude,longitude,appearedTimeOfDay,temperature,population_density,hour_1,hour_2,hour_3,hour_4,...,city_Warsaw,city_Winnipeg,city_Zagreb,city_Zurich,continent_America,continent_Asia,continent_Australia,continent_CentralAmerica,continent_Europe,continent_Indian
0,16,20.525745,-97.460829,3,25.5,2431.234100,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0
1,133,20.523695,-97.461167,3,25.5,2431.234100,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0
2,16,38.903590,-77.199780,3,24.2,761.885600,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0
3,13,47.665903,-122.312561,3,15.6,4842.162600,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0
4,133,47.666454,-122.311628,3,15.6,4842.162600,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0
5,21,-31.954980,115.853609,3,16.5,2102.977500,0,0,1,0,...,0,0,0,0,0,0,1,0,0,0
6,66,-31.954245,115.852038,3,16.5,2102.977500,0,0,1,0,...,0,0,0,0,0,0,1,0,0,0
7,27,26.235257,-98.197591,3,28.0,849.442260,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0
8,35,20.525554,-97.458800,3,25.5,2431.234100,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0
9,19,32.928558,-84.340278,3,23.7,86.498360,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0


## The Use of Classifiers

In this portion of the code, I will be demonstrating the different classifiers and to demonstrate which classifier have the best accuracy to fit our data.

In [None]:
train_features = train.drop(['pokemonId'],1)
train_labels = train['pokemonId']
X_train, X_test, Y_train, Y_test = train_test_split(train_features, train_labels, train_size = 0.7, random_state = 46)
X_train.shape, Y_train.shape, X_test.shape

In [None]:
X_train.columns

## Bernoulli

In [None]:
model = BernoulliNB()
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)
acc_1 = round(accuracy_score(Y_test, Y_pred)*100, 2)
acc_1

## K Neighbor Classifier

In [None]:
model = KNeighborsClassifier(n_neighbors = 3)
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)
acc_3 = round(accuracy_score(Y_test, Y_pred)*100, 2)

## Gaussian Classifier

In [None]:
model = GaussianNB()
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)
acc_5 = round(accuracy_score(Y_test, Y_pred)*100, 2)

## Perceptron

In [None]:
model = Perceptron()
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)
acc_6 = round(accuracy_score(Y_test, Y_pred)*100, 2)

## SGDC Classifier

In [None]:
model = SGDClassifier()
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)
acc_8 = round(accuracy_score(Y_test, Y_pred)*100, 2)

## Decision Tree Classifier

In [None]:
model = DecisionTreeClassifier()
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)
acc_9 = round(accuracy_score(Y_test, Y_pred)*100, 2)

Now that the data has been calculated, I placed all the information into a dataset for two reason. One is to get a clear view of which classifier has the best accuracy. Second is to be able to place the dataset into csv file to be able to present it during the presentaiton in order to avoid using time. 

In [None]:
models = pd.DataFrame({
    'Model' : ['BernoulliNB', 'KNeighbors', 'Gaussian', 'Perceptron',  'Stochastic Gradient Decent', 'Decision Tree'],
    'Accuracy Score' : [acc_1, acc_3, acc_5, acc_6, acc_8, acc_9]
    })
models.sort_values(by='Accuracy Score', ascending=False)

In [None]:
models.to_csv('/data/glirios/poke_models.csv')