Final Project, Part 3: Technical Notebook
Build and document a working model, prototype, recommendation, or solution.

Develop a prototype model or process to successfully resolve the business problem you've chosen. Document your work in a technical notebook that can be shared with your peers.

Build upon your earlier analysis, following the performance metrics you established as part of your problem's evaluation criteria. Demonstrate your approach logically, including all relevant code and data. Polish your notebook for peer audiences by cleanly formatting sections, headers, and descriptions in markdown. Include comments in any code.

Requirements
A detailed Jupyter Notebook with a summary of your analysis, approach, and evaluation metrics.
Clearly formatted structure with section headings and markdown descriptions.
Comments explaining your code.

Note: Here are some things to consider in your notebook: sample size, correlations, feature importance, unexplained variance or outliers, variable selection, train/test comparison, and any relationships between your target and independent variables.

Submission
Submit or share your project brief as per your instructor's directions.

In [8]:
#load libraries and data
%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import seaborn as sns
from pandas.plotting import scatter_matrix

ow = pd.read_csv('assets/overwatch-diary.csv', index_col = 0)

In [13]:
from sklearn.feature_selection import SelectPercentile
from sklearn.model_selection import train_test_split

ow.fillna(ow.median(),inplace=True)

columns = ['game_id',        
'sr_start',       
'sr_finish',      
'streak_number',  
'my_team_sr',     
'enemy_team_sr',  
'round',          
'capscore',       
'score_distance', 
'match_length',   
'eliminations',   
'objective_kills',
'healing',        
'deaths',         
'weapon_accuracy',
'offensive_assists',
'defensive_assists']
for column in columns:
     ow[column].fillna(ow[column].median(), inplace=True)

In [14]:
ow.dtypes

result                      object
game_id                    float64
sr_start                   float64
sr_finish                  float64
streak_number              float64
my_team_sr                 float64
enemy_team_sr              float64
map                         object
round                      float64
capscore                   float64
score_distance             float64
team_role                   object
match_length               float64
charcter_1                  object
character_2                 object
character_3                 object
psychological_condition     object
eliminations               float64
objective_kills            float64
healing                    float64
deaths                     float64
weapon_accuracy            float64
offensive_assists          float64
defensive_assists          float64
dtype: object

In [17]:
count_nan = len(ow) - ow.count()
count_nan

result                        0
game_id                       0
sr_start                      0
sr_finish                     0
streak_number                 0
my_team_sr                    0
enemy_team_sr                 0
map                           0
round                         0
capscore                      0
score_distance                0
team_role                     6
match_length                  0
charcter_1                   60
character_2                2755
character_3                3209
psychological_condition    1999
eliminations                  0
objective_kills               0
healing                       0
deaths                        0
weapon_accuracy               0
offensive_assists             0
defensive_assists             0
dtype: int64

In [18]:
X_train, X_test, y_train, y_test = train_test_split(ow['sr_start'], ow.result, test_size = 0.5)

In [20]:
select = SelectPercentile(percentile=60)
select.fit(X_train, y_train)

ValueError: Expected 2D array, got 1D array instead:
array=[2783. 2568. 2568. ... 2713. 2568. 2510.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [None]:
#transform training
X_train_selected = select.transform(X_train)

In [None]:
X_train.shape

In [None]:
X_train_selected.shape

In [None]:
mask = select.get_support()

In [None]:
print(mask)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
#black is true 
plt.matshow(mask.reshape(1, -1), cmap = 'gray_r')
plt.xlabel('Sample index')


<center><b>Data Dictionary:</b> https://docs.google.com/document/d/1aYqMaddLpN0Skgwa17HRg-jwqOZ0NVLyWd1-zyTXGro/edit?usp=sharing</center><br><br>

## 1. EXPLORATORY DATA ANALYSIS

In [None]:
#Review columns
ow.head()

## 2. DETERMINE HOW TO HANDLE MISSING VALUES

In [6]:
# Review missing values in each column
count_nan = len(ow) - ow.count()
count_nan

result                        0
game_id                       0
sr_start                      0
sr_finish                     0
streak_number                 0
my_team_sr                    0
enemy_team_sr                 0
map                           0
round                         0
capscore                      0
score_distance                0
team_role                     6
match_length                  0
charcter_1                   60
character_2                2755
character_3                3209
psychological_condition    1999
eliminations                  0
objective_kills               0
healing                       0
deaths                        0
weapon_accuracy               0
offensive_assists             0
defensive_assists             0
dtype: int64

In [7]:
#finding missing values of float features
ow.median()

game_id               145.00
sr_start             2568.00
sr_finish            2568.00
streak_number           1.00
my_team_sr           2655.00
enemy_team_sr        2656.00
round                   2.00
capscore                2.00
score_distance         71.75
match_length            6.00
eliminations           14.00
objective_kills         6.00
healing              5418.00
deaths               9460.00
weapon_accuracy         9.00
offensive_assists      28.00
defensive_assists       6.00
dtype: float64

In [None]:
#the significant missing values are mostly from float features, so let's fill nulls with the feature median
ow.fillna(ow.median(),inplace=True)

columns = ['game_id',        
'sr_start',       
'sr_finish',      
'streak_number',  
'my_team_sr',     
'enemy_team_sr',  
'round',          
'capscore',       
'score_distance', 
'match_length',   
'eliminations',   
'objective_kills',
'healing',        
'deaths',         
'weapon_accuracy',
'offensive_assists',
'defensive_assists']
for column in columns:
     ow[column].fillna(ow[column].mean(), inplace=True)

In [None]:
#confirming that null values are now filled w median
ow.describe()