# Sprint 4 Project
## Exploratory Data Analysis
In this notebook, I will perform some exploratory data analysis on a dataset that I retrieved from Kaggle.com. I will use my findings and my code from this notebook to create and deploy a webapp to share my analysis with others.

### Importing libraries and data

In [1]:
# import libraries
import pandas as pd
import streamlit as st
import plotly_express as px

#load the data into a DataFrame using pandas
games_df = pd.read_csv('../games.csv')
games_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20058 entries, 0 to 20057
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              20058 non-null  object 
 1   rated           20058 non-null  bool   
 2   created_at      20058 non-null  float64
 3   last_move_at    20058 non-null  float64
 4   turns           20058 non-null  int64  
 5   victory_status  20058 non-null  object 
 6   winner          20058 non-null  object 
 7   increment_code  20058 non-null  object 
 8   white_id        20058 non-null  object 
 9   white_rating    20058 non-null  int64  
 10  black_id        20058 non-null  object 
 11  black_rating    20058 non-null  int64  
 12  moves           20058 non-null  object 
 13  opening_eco     20058 non-null  object 
 14  opening_name    20058 non-null  object 
 15  opening_ply     20058 non-null  int64  
dtypes: bool(1), float64(2), int64(4), object(9)
memory usage: 2.3+ MB


There are no missing values. Next, check for duplicate rows.

In [2]:
# checking for duplicate rows
games_df.duplicated().sum()

429

There are 429 duplicate rows, further investigation is required.

In [3]:
# filter for duplicated rows and review them
duplicate_games = games_df[games_df.duplicated()]
duplicate_games.sample(n=10)


Unnamed: 0,id,rated,created_at,last_move_at,turns,victory_status,winner,increment_code,white_id,white_rating,black_id,black_rating,moves,opening_eco,opening_name,opening_ply
11269,WRCq0dJs,True,1396617000000.0,1396618000000.0,51,resign,white,7+8,rubi,1686,vladimir-kramnik-1,1567,e4 d6 Nf3 e5 d4 Nc6 d5 Nce7 c4 Bg4 Be2 Ng6 Nc3...,C41,Philidor Defense,5
4271,vwkSIpZ0,True,1474470000000.0,1474470000000.0,75,mate,white,7+2,kasparovd0311,1473,chees29,1221,e4 e5 Nf3 Nc6 Bb5 a6 Ba4 Nf6 O-O Be7 Re1 O-O B...,C84,Ruy Lopez: Closed Variations,10
11271,3YW2jdCY,True,1396531000000.0,1396531000000.0,45,resign,white,10+10,patolino,1519,vladimir-kramnik-1,1597,e4 e5 d3 Nf6 Bg5 Be7 Qc1 d5 f3 Nc6 c3 h6 Bh4 O...,C20,King's Pawn Game: Leonardis Variation,3
11215,3CB6Onyb,True,1446093000000.0,1446094000000.0,113,mate,white,10+0,tori92,2084,sunsettosunrise,2066,e4 c5 Nf3 d6 d4 cxd4 Nxd4 Nf6 Nc3 g6 Bb5+ Nbd7...,B70,Sicilian Defense: Dragon Variation,10
14763,yZwB2tve,True,1504338000000.0,1504339000000.0,81,resign,black,5+5,lugg1,1141,a_p_t_e_m_u_u,1369,e4 e5 f4 d6 Bc4 Nf6 Nc3 Nxe4 Nxe4 d5 Bb5+ c6 B...,C30,King's Gambit,3
11305,UkIkOmsC,False,1395856000000.0,1395856000000.0,33,resign,white,7+2,seyen,1624,great-mate-super-gre,1620,e4 e5 d4 exd4 c3 dxc3 Bc4 cxb2 Bxb2 Nc6 Qf3 Nf...,C21,Danish Gambit Accepted,9
11301,cHhT3SPI,True,1395944000000.0,1395946000000.0,58,mate,black,15+60,romanelli,1387,great-mate-super-gre,1739,d4 d5 g3 Nf6 Bg2 Nc6 c4 dxc4 Bg5 h6 Bxf6 exf6 ...,D00,Queen's Pawn Game,2
14520,XNMbvz8v,True,1504169000000.0,1504169000000.0,70,mate,black,10+0,superhero098,1997,miqomanukyan2001,1480,e4 e5 d4 exd4 Qxd4 Nc6 Qe3 b6 Bd3 Bc5 Nc3 Bxe3...,C22,Center Game: Paulsen Attack Variation,7
2554,JrFxlGZo,False,1489330000000.0,1489330000000.0,78,mate,black,15+0,sp37578,1309,penguin100,1393,e4 d5 exd5 Qxd5 Nc3 Qe6+ Qe2 Qc6 d3 Bf5 Qf3 Be...,B01,Scandinavian Defense: Mieses-Kotroc Variation,4
13597,TaJPxLiw,True,1502371000000.0,1502372000000.0,32,mate,black,15+3,srn3000,1336,kylarr,1275,c4 Nf6 Nc3 g6 g3 Bg7 Bg2 h5 d4 h4 gxh4 Rxh4 e3...,A16,English Opening: Anglo-Indian Defense | Queen...,3


In [4]:
# checking that they are true duplicates:
for id in duplicate_games['id'].sample(n=10):
    print(games_df[games_df['id']==id])    

             id  rated    created_at  last_move_at  turns victory_status  \
11331  awZ3R17W   True  1.492935e+12  1.492936e+12     42         resign   
13602  awZ3R17W   True  1.492935e+12  1.492936e+12     42         resign   

      winner increment_code        white_id  white_rating black_id  \
11331  white          12+10  dagobertzeiler          1333   kylarr   
13602  white          12+10  dagobertzeiler          1333   kylarr   

       black_rating                                              moves  \
11331          1347  e4 c6 Bc4 d5 exd5 cxd5 Bb3 Nf6 Nc3 Bf5 Nf3 e6 ...   
13602          1347  e4 c6 Bc4 d5 exd5 cxd5 Bb3 Nf6 Nc3 Bf5 Nf3 e6 ...   

      opening_eco                         opening_name  opening_ply  
11331         B00  Caro-Kann Defense: Hillbilly Attack            3  
13602         B00  Caro-Kann Defense: Hillbilly Attack            3  
             id  rated    created_at  last_move_at  turns victory_status  \
10504  glk9SiIG   True  1.446189e+12  1.446190e+12 

These duplicates are truly duplicates, they can be safely dropped.

In [5]:
# drop duplicates
games_df.drop_duplicates(inplace=True)
games_df.duplicated().sum()

0

With the duplicates gone, the dataset is ready for some EDA.

### Enhance the data

Let's begin by adding some columns:

To enhance the data:

* **"rating_difference"** (int) - the difference in rating between players, using "white_rating" and "black_rating"
* **"avg_rating"** (float) - the average rating between players, using "white_rating" and "black_rating"

To better group the data:

* **"opening"** (string) - a more generalized opening, using "opening_name"
* **"skill_level"** (string) - the category of the players' rating ("beginner", "intermediate", "advanced", "expert, "master"), using "avg_rating"
* **"time_control"** (string) - the category of time increment ("bullet", "blitz", "rapid", "classic"), using "increment_code"

In [6]:
# add rating_difference column by determining difference between white's and black's ratings
games_df['rating_difference'] = games_df['white_rating'] - games_df['black_rating']

# add avg_rating column by determining average between white's and black's ratings
games_df['avg_rating'] = (games_df['white_rating'] + games_df['black_rating']) / 2

# define a function to reduce the complexity of the opening name
def get_opening(opening_name):
    ''' This function will take a game instance from the games_df dataframe and return a shortened version of the opening_name
    
        parameters:
        opening_name - a row from the games_df dataframe

        returns:
        opening (string) - the generalized name of the opening played
        '''    
    return opening_name.split(':')[0]

# define a function to determine skill_level by grouping rating scores into categories
def get_skill_level(rating):
    ''' This function will take a game instance from the games_df dataframe and return a classifying string
    
        parameters:
        game - a row from the games_df dataframe
        
        returns:
        skill_level (string) - the rating classification for players in the game
    '''
    if rating <= 1400:
        return 'beginner'
    elif 1400 < rating <= 1600:
        return 'intermediate'
    elif 1600 < rating <= 1800:
        return 'advanced'
    else:
        return 'master'

# define a function to determine time control by grouping time increments into categories
def get_time_control(increment):
    ''' This function will take a game instance from the games_df dataframe and return a classifying string
    
        parameters:
        game - a row from the games_df dataframe
        
        returns:
        time_control (string) - the time control type for the game
    '''
    time = int(increment.split('+')[0])

    if time < 5:
        return 'bullet'
    elif 5 <= time < 10:
        return 'blitz'
    elif 10 <= time < 30:
        return 'rapid'
    else:
        return 'classic'
    
# create new columns by applying the above functions    
games_df['opening'] = games_df['opening_name'].apply(get_opening)
games_df['skill_level'] = games_df['avg_rating'].apply(get_skill_level)
games_df['time_control'] = games_df['increment_code'].apply(get_time_control)

### Exploring the data

In [7]:
st.header('Total turns played in a game vs. Average rating of the players')

# generate a scatter plot of average player rating vs. total turns, colored by time control
px.scatter(games_df, x='avg_rating', y='turns', color='time_control',  title='Chess Game Total Turns vs. Average Player Rating',
    labels=dict(avg_rating='Average Player Rating', turns='Total Turns', time_control='Time Control'), opacity=.75)

2022-11-04 14:13:23.014 
  command:

    streamlit run C:\Users\joshg\AppData\Roaming\Python\Python310\site-packages\ipykernel_launcher.py [ARGUMENTS]


In [8]:
# observe most frequent openings
top_openings = games_df['opening'].value_counts()[:20]
print(top_openings)

Sicilian Defense           2527
French Defense             1273
Queen's Pawn Game          1039
Italian Game                953
King's Pawn Game            899
Ruy Lopez                   833
English Opening             715
Scandinavian Defense        707
Caro-Kann Defense           581
Scotch Game                 455
Queen's Gambit Declined     384
Van't Kruijs Opening        363
Four Knights Game           353
Zukertort Opening           324
Bishop's Opening            322
Indian Game                 316
Philidor Defense            267
Queen's Gambit Refused      248
Queen's Gambit Accepted     248
Russian Game                247
Name: opening, dtype: int64


In [9]:
st.header('Game End Conditions')

# show histogram of game outcomes, colored by skill level
px.histogram(games_df, x='victory_status', color='skill_level', title='Game End Conditions',
    labels=dict(victory_status='End Condition', skill_level='Skill Level'))


In [10]:
st.header('Rating difference')
st.write('(positive values indicate white is higher rated)')

# create a checkbox to allow the use filter for games where 'rated' is True
rated_only = st.checkbox("Include only rated games")

# create a copy of our data that we can filter if the user chooses to do so
filter_df = games_df
if rated_only:
    filter_df = games_df[games_df['rated']]

# generate a histogram of the rating difference on the filtered data
fig = px.histogram(filter_df, x='rating_difference', nbins=250, range_x=[-500,500])

# use streamlit to display fig
st.write(fig)
fig

In [11]:
st.header('View winner color frequency by opening and skill level')

# generate a list of the 20 most popular openings
top_openings = games_df['opening'].value_counts()[:20]
openings = list(top_openings.index.unique())

# create an enhanced list with a 'Select all' option
openings_2 = openings[:]
openings_2.append('Select all')

# allow the user to select an opening, defaults on 'Select all'
opening = st.selectbox("Select an opening", openings_2, openings_2.index('Select all'))

# if the user choose 'Select all', we need to pass the whole list to our filter
if opening == 'Select all':
    opening = openings

# get a list of the skill levels available to choose from
skill_levels = list(games_df['skill_level'].unique())

# create an enhanced list with a 'Select all' option
skill_levels_2 = skill_levels[:]
skill_levels_2.append('Select all')

# allow the user to select a skill level, defaults on 'Select all'
skill_level = st.selectbox("Select a skill level", skill_levels_2, skill_levels_2.index('Select all'))

# if the user chooses 'Select all', we need to pass the whole list to our filter
if skill_level == 'Select all':
    skill_level = skill_levels

# filter our data per users choices
q_string = 'opening in @opening and skill_level in @skill_level'
filter_df = games_df.query(q_string)

# generate the histogram of winner color, in a consistent order with plotly express
fig = px.histogram(filter_df, x='winner', category_orders={'winner':['white','black','draw']})

# display the figure with streamlit
st.write(fig)
fig

In [12]:
px.scatter(games_df, x='avg_rating',y='rating_difference')

### Conclusions

We were able to read our csv file and found that it was fairly well prepared already; there were no missing values and the duplicates were easy to find and eliminate.

We created some new columns relevant to our exploratory analysis. We attempted to identify trends by plotting our variables in appropriate ways and did find some interesting results:

* As player rating increases, fewer games are ended in the earlier rounds. 

* As player rating increases beyond 2000, there is a noticeable taper that also reduces games with higher relative turn counts.

* As skill level increases, the ratio of resignations to checkmates also increases.

* The rating difference between players is normally distributed with a mean near zero. The distribution of unrated games has a slightly flatter shape than the overall distribution.

* Some openings have very different win probabilities depending on skill level. For example, the Scandanavian opening is very successful for black in higher skill brackets, but lower rated players should probably try a different approach.

