# 400_prep_RQ1_Dataframe
## Purpose 
In this notebook we prepare a dataframe for our Research Question 1.  
## Datasets 
* _Input_: Joined1617.csv,Joined1516.csv,Joined1415.csv,Joined1314.csv,Joined1213.csv
* _Output_: RQ1.csv

In [1]:
import math
import os.path
import numpy as np
import pandas as pd

### Reading in our cleaned Joined datasets from 16-17 to 12-13.

In [2]:
DF1 = pd.read_csv("../../data/prep/Joined1617.csv")
DF2 = pd.read_csv("../../data/prep/Joined1516.csv")
DF3 = pd.read_csv("../../data/prep/Joined1415.csv")
DF4 = pd.read_csv("../../data/prep/Joined1314.csv")
DF5 = pd.read_csv("../../data/prep/Joined1213.csv")
DF6 = pd.read_csv("../../data/prep/Joined1718.csv")

## Choosing which columns are needed for the Research Question
* index refers to all columns that have a fixed figure in FIFA.

In [3]:
index = ['Players','club','league','age','nationality','Position','overall','photo','injury_prone_trait',"fan's_favourite_trait"]

### Concatting each years using index and Points & Apps

In [4]:
s1 = DF1.drop_duplicates(index).set_index(index)[['Points','Apps']]
s2 = DF2.drop_duplicates(index).set_index(index)[['Points','Apps']]
s3 = DF3.drop_duplicates(index).set_index(index)[['Points','Apps']]
s4 = DF4.drop_duplicates(index).set_index(index)[['Points','Apps']]
s5 = DF5.drop_duplicates(index).set_index(index)[['Points','Apps']]
s6 = DF6.drop_duplicates(index).set_index(index)[['Form','Apps']]

RQ1 = pd.concat([s1,s2,s3,s4,s5,s6], axis=1, keys=('16/17','15/16','14/15','13/14','12/13','17/18')).fillna(0).astype(float).reset_index()

## Filtering
**For reproducibility purposes**
* Removes players with no appearances.

In [5]:
RQ1 = RQ1.loc[~((RQ1['16/17']['Apps'] == 0) & (RQ1['15/16']['Apps'] == 0) & (RQ1['14/15']['Apps'] == 0) & (RQ1['13/14']['Apps'] == 0) & (RQ1['12/13']['Apps'] == 0) & (RQ1['17/18']['Apps'] == 0))]

### Creating a new column 'Valid Seasons'
* This checks if a player has played more than 9 games in a season if true it adds one to a count column.
* If the total of columns is greater than 2 seasons they are included in the research question.
* This is used because to truly judge a player accurately you need atleast two seasons of information 

In [6]:
RQ1['count1'] = 0
RQ1.loc[RQ1['16/17']['Apps']>9.0, 'count1'] = 1
RQ1['count2'] = 0
RQ1.loc[RQ1['15/16']['Apps']>9.0, 'count2'] = 1
RQ1['count3'] = 0
RQ1.loc[RQ1['14/15']['Apps']>9.0, 'count3'] = 1
RQ1['count4'] = 0
RQ1.loc[RQ1['13/14']['Apps']>9.0, 'count4'] = 1
RQ1['count5'] = 0
RQ1.loc[RQ1['12/13']['Apps']>9.0, 'count5'] = 1

RQ1['Valid Seasons'] = RQ1['count1'] + RQ1['count2'] + RQ1['count3'] + RQ1['count4'] + RQ1['count5']

RQ1 = RQ1[RQ1['Valid Seasons'] >= 2] 

## Creating New Columns
* **Overall Points** - contains the total number of points a player has had.<br><br>
* **Overall Apps** - contains the total number of appearances a player has made.<br><br>
* **Seasons Played** - is the total number of Seasons a player has played.<br><br>

* **Average Apps/Season** - is the Overall Pointed divided by Overall Apps.<br><br>

* **Current Form** - is the current form of a player in the 17/18 season.<br><br>

* **Apps This Season** -  is the total appearances a player has made this season.<br><br>

In [7]:
RQ1['Overall Points'] = RQ1['16/17']['Points']+RQ1['15/16']['Points']+RQ1['14/15']['Points']+RQ1['13/14']['Points']+RQ1['12/13']['Points']

In [8]:
RQ1['Overall Apps'] = RQ1['16/17']['Apps']+RQ1['15/16']['Apps']+RQ1['14/15']['Apps']+RQ1['13/14']['Apps']+RQ1['12/13']['Apps']

In [9]:
RQ1['count6'] = 0
RQ1.loc[RQ1['16/17']['Apps']>0.0, 'count6'] = 1
RQ1['count7'] = 0
RQ1.loc[RQ1['15/16']['Apps']>0.0, 'count7'] = 1
RQ1['count8'] = 0
RQ1.loc[RQ1['14/15']['Apps']>0.0, 'count8'] = 1
RQ1['count9'] = 0
RQ1.loc[RQ1['13/14']['Apps']>0.0, 'count9'] = 1
RQ1['count10'] = 0
RQ1.loc[RQ1['12/13']['Apps']>0.0, 'count1='] = 1

RQ1['Seasons Played'] = RQ1['count6'] + RQ1['count7'] + RQ1['count8'] + RQ1['count9'] + RQ1['count10']

In [10]:
RQ1['Average Apps/Season'] = RQ1['Overall Apps']/RQ1['Seasons Played']

In [11]:
RQ1['Previous Average Form'] = RQ1['Overall Points']/RQ1['Overall Apps']

In [12]:
RQ1['Current Average Form'] = RQ1['17/18']['Form']

In [13]:
RQ1['Apps This Season'] = RQ1['17/18']['Apps']

## Tidying Up
* Choosing all the relevant columns to the questions we wish to ask.

In [14]:
RQ1 = RQ1[['Players','Position','club','league','age','nationality','overall','photo','injury_prone_trait',"fan's_favourite_trait",'Average Apps/Season','Previous Average Form','Apps This Season','Current Average Form']]
RQ1.head(5)

Unnamed: 0,Players,Position,club,league,age,nationality,overall,photo,injury_prone_trait,fan's_favourite_trait,Average Apps/Season,Previous Average Form,Apps This Season,Current Average Form
,,,,,,,,,,,,,,
0.0,Aaron Cresswell,Defender,West Ham United,English Premier League,27.0,England,76.0,https://cdn.sofifa.org/18/players/189615.png,False,False,33.666667,2.821782,34.0,2.5
1.0,Aaron Hughes,Defender,Heart of Midlothian,Scottish Premiership,37.0,Northern Ireland,71.0,https://cdn.sofifa.org/18/players/17725.png,True,False,37.0,1.216216,0.0,0.0
2.0,Aaron Hunt,Midfielder,Hamburger SV,German Bundesliga,30.0,Germany,76.0,https://cdn.sofifa.org/18/players/158138.png,True,False,29.5,3.050847,28.0,2.428571
3.0,Aaron Lennon,Midfielder,Everton,English Premier League,30.0,England,77.0,https://cdn.sofifa.org/18/players/152747.png,False,False,30.0,2.633333,27.0,1.888889
6.0,Aaron Niguez,Midfielder,Real Oviedo,Spanish Segunda Division,28.0,Spain,73.0,https://cdn.sofifa.org/18/players/183853.png,False,False,21.5,1.511628,0.0,0.0


In [15]:
RQ1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1791 entries, 0 to 4322
Data columns (total 14 columns):
(Players, )                  1791 non-null object
(Position, )                 1791 non-null object
(club, )                     1791 non-null object
(league, )                   1791 non-null object
(age, )                      1791 non-null int64
(nationality, )              1791 non-null object
(overall, )                  1791 non-null int64
(photo, )                    1791 non-null object
(injury_prone_trait, )       1791 non-null bool
(fan's_favourite_trait, )    1791 non-null bool
(Average Apps/Season, )      1791 non-null float64
(Previous Average Form, )    1791 non-null float64
(Apps This Season, )         1791 non-null float64
(Current Average Form, )     1791 non-null float64
dtypes: bool(2), float64(4), int64(2), object(6)
memory usage: 185.4+ KB


#### Saving to csv file in data/analysis

In [16]:
RQ1.to_csv('../../data/analysis/RQ1.csv')