# COGS 108 - Data Checkpoint

# Names

- Crystal Zhan
- Akil Selvan Rajendra Janarthanan 
- Kristen Prescaro
- Kristine Thipatima
- Ethan Dinh-Luong

<a id='research_question'></a>
# Research Question

How did the addition of the Fairy type in Pokemon affect the usage of Dragon types in battles on Pokemon Showdown? Additionally, what are the effects on different Pokemon formats, differently rated players, other related types, and across generations?

# Dataset(s)

- Dataset Name: Pokedex
- Link to the dataset: https://github.com/smogon/pokemon-showdown/blob/master/data/pokedex.ts
- Number of observations: 1155
This dataset has every single Pokemon and many of their attributes, like name, gender, height, and stats. We are using their name and type. 

- Dataset Name: Moves
- Link to the dataset: https://github.com/smogon/pokemon-showdown/blob/master/data/moves.ts
- Number of observations: 859 
This dataset has every single move a Pokemon can learn and attributes, like power, accuracy, and type. We will be using the move's type and name. 

- Dataset Name: Pokemon Showdown Battle Stats 
- Link to the dataset: https://www.smogon.com/stats/
- Number of observations: a lot
This dataset has all the statistics from Pokemon Showdown battles ranging from 2014 to now in different battle formats. We will extract the top Pokemon used in specific formats and months at high rating, alongside the Pokemon usage % and move usage %. 

To combine these datasets, we will be using the Pokemon's name. 

1-2 sentences describing each dataset. 

If you plan to use multiple datasets, add 1-2 sentences about how you plan to combine these datasets.

# Setup

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Data Cleaning

Describe your data cleaning steps here.

In [2]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

## Pokemon Showdown Battle Stats

The data given by Pokemon Showdown is several semi-structured JSON format files, where cleaning was needed to read the data into a usable format. 

*The cleaning process exampled below was repeated for all other JSON files.*

The file given by Pokemon Showdown was downloaded and imported into the Notebook, and unnecessary data prior to our analysis was removed where data was NaN.

In [8]:
raw = pd.read_json("Pokemon Usage/September/raw/gen8/gen8ou-0.json")
df = raw[raw['data'].notna()]['data']
df

Mr. Mime-Galar    {'Moves': {'': 32.0, 'healingwish': 226.0, 'bl...
Eevee             {'Moves': {'': 197.0, 'rest': 7.0, 'mudslap': ...
Torracat          {'Moves': {'': 1.0, 'firespin': 20.0, 'leechli...
Poliwrath         {'Moves': {'': 58.0, 'counter': 48.0, 'liquida...
Emolga            {'Moves': {'': 2.0, 'eerieimpulse': 47.0, 'ris...
                                        ...                        
Shedinja          {'Moves': {'': 578.0, 'absorb': 11.0, 'falsesw...
Wishiwashi        {'Moves': {'': 67.0, 'liquidation': 393.0, 'be...
Sneasel           {'Moves': {'counter': 3.0, 'beatup': 9.0, 'bli...
Hitmontop         {'Moves': {'': 208.0, 'detect': 89.0, 'quickgu...
Kingdra           {'Moves': {'': 57.0, 'icywind': 32.0, 'liquida...
Name: data, Length: 440, dtype: object

In the dataset, each Pokemon are described with the following variables:

In [11]:
df[0].keys()

dict_keys(['Moves', 'Checks and Counters', 'Abilities', 'Teammates', 'usage', 'Items', 'Raw count', 'Spreads', 'Happiness', 'Viability Ceiling'])

To narrow down the data desired for our analysis, the following criteria were used to filter out the data:
- Pokemon with at least 2% usage
- Each Pokemon's Top 6 Moves

In [16]:
### Dictionary to make the DataFrame
top_mons = {}

### Saves the Pokemon as Indexes
ix = list(df.index)

### For each Observation
for row in range(len(df)):

    ### At least 2% Usage
    if df[row]['usage'] >= .02:

        ### Pokemon Name
        mon = ix[row]

        ### Finds the Top 6 Moves
        top_6 = list(dict(sorted(df[row]['Moves'].items(), key=lambda item: item[1], reverse=True)))[:6]
        
        ### Saves info to dictionary
        top_mons[mon] = [top_6, df[row]['usage']]

### Output DataFrame
cleaned = pd.DataFrame.from_dict(top_mons, orient = 'index').rename(columns = {0:"Moves", 1:"Usage"})
cleaned

Unnamed: 0,Moves,Usage
Landorus-Therian,"[earthquake, uturn, stealthrock, knockoff, tox...",0.304108
Blissey,"[softboiled, seismictoss, toxic, teleport, thu...",0.084829
Slowbro,"[scald, teleport, slackoff, futuresight, icebe...",0.057747
Crawdaunt,"[aquajet, knockoff, crabhammer, swordsdance, c...",0.028303
Urshifu-Rapid-Strike,"[surgingstrikes, closecombat, aquajet, uturn, ...",0.129478
...,...,...
Arctozolt,"[boltbeak, lowkick, blizzard, substitute, free...",0.026643
Melmetal,"[doubleironbash, thunderpunch, earthquake, ice...",0.092703
Mew,"[taunt, stealthrock, spikes, icebeam, roost, s...",0.060138
Hippowdon,"[earthquake, slackoff, stealthrock, toxic, whi...",0.042812


Additionally, each dataframe includes 2 more columns identifying which JSON file the data originated from, denoted by **Gen**, **Format**, and **Rating**, given in the first few rows of the JSON file.

In [18]:
metagame = raw.loc['metagame'][0]
gen = metagame[3]
format_name = metagame[4:]
rating = raw.loc["cutoff deviation"][0]
cleaned["Gen"] = gen
cleaned["Format"] = format_name
cleaned["Min Rating"] = rating
cleaned

Unnamed: 0,Moves,Usage,Gen,Format,Min Rating
Landorus-Therian,"[earthquake, uturn, stealthrock, knockoff, tox...",0.304108,8,ou,0
Blissey,"[softboiled, seismictoss, toxic, teleport, thu...",0.084829,8,ou,0
Slowbro,"[scald, teleport, slackoff, futuresight, icebe...",0.057747,8,ou,0
Crawdaunt,"[aquajet, knockoff, crabhammer, swordsdance, c...",0.028303,8,ou,0
Urshifu-Rapid-Strike,"[surgingstrikes, closecombat, aquajet, uturn, ...",0.129478,8,ou,0
...,...,...,...,...,...
Arctozolt,"[boltbeak, lowkick, blizzard, substitute, free...",0.026643,8,ou,0
Melmetal,"[doubleironbash, thunderpunch, earthquake, ice...",0.092703,8,ou,0
Mew,"[taunt, stealthrock, spikes, icebeam, roost, s...",0.060138,8,ou,0
Hippowdon,"[earthquake, slackoff, stealthrock, toxic, whi...",0.042812,8,ou,0
