# Welcome

With our fundamentals covered, we can now turn to analyzing some data. You'll notice that the `data` folder contains [three files](https://github.com/hermish/hkn-workshops/tree/master/CS/SP19-data-workshop/data).

1. The first one contains the pokemon characteristics (the first column being the id of the pokemon).
2. The second one contains information about previous combats. The first two columns contain the ids of the combatants and the third one the id of the winner. Important: The pokemon in the first columns attacks first.

Open [these files](https://github.com/hermish/hkn-workshops/tree/master/CS/SP19-data-workshop/data) to get a feel for what the raw data looks like: our data is in a `csv` format which stands for comma-separated values. Basically, our data comes in a table, with entries separated by commas.

# Introduction: Ash Ketchum

Ash Ketchum once stated: "I want to be the very best, like no one ever was." With those words, Ash set out on his quest to become a Pokemon master. Oddly enough, he *still* has not won a single Pokemon League Championship, despite competing in 6. Today, we're going to explore the Pokemon dataset in order to hone our data science techniques and become Pokemon masters!

<img src="ash.jpg">

## Reading in Data

To be able to do anything with this data, we first need to read the data into python. Luckily, this is such a common task that there are already many tools for working with, reading and visualizing data. We'll be using a library called `Pandas` for data analysis. `Pandas` is currently one of the most popular data analysis libraries, and is used in many UC Berkeley Data Science courses. Below, we import Pandas and some other functions to read the data.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing
import matplotlib.pyplot as plt # plotting
import seaborn as sns # aesthetics
%config InlineBackend.figure_format = 'retina'

In [None]:
# Read pokemon statistics and print first 10 rows out of 800
data = pd.read_csv('data/pokemon.csv')
print(len(data))
data.head(10)

## Visualizing Data: Matplotlib

The first step to strong data analysis is being able to visualize the data provided. Matplotlib is a python library that helps us plot data: the most basic plots are line, scatter and histogram plots.

- Line plots are usually used when the x-axis is time
- Scatterplots are better when we want to check if there is correlation between two variables
- Histograms visualize the distribution of numerical data


Matplotlib allows us to customize the colors, labels, thickness of lines, title, opacity, grid, figure size... Basically, we can make any graph conceivable, though some are easier than others.

In [None]:
# LINE PLOTS
data['Speed'].plot(
    kind='line', # Type of plot
    color='blue', # Line color
    label='Speed', # Legend name
    linewidth=1,
    alpha = 0.5, # Transparency
    grid = True
)
data['Defense'].plot(
    color='green',
    label='Defense',
    linewidth=1,
    alpha = 0.5,
    grid = True
)

plt.legend(loc='upper right') # legend locations
plt.xlabel('Pokemon Number') # x-axis label
plt.ylabel('Stat') # y-axis label
plt.title('Speed and Defense across Pokemon') # Title
plt.show()

In [None]:
# SCATTER PLOT
data.plot(
    kind='scatter', # Type of plot
    x='Attack', # x-axis column
    y='Defense', # y-axis column
    alpha = 0.5, # Transparency
    color = 'red'
)
plt.xlabel('Attack') # x-axis label
plt.ylabel('Defense') # y-axis label
plt.title('Attack-Defense Relationship') # Title
plt.show()

In [None]:
# HISTOGRAM
data['Speed'].plot( # Choosing what to plot
    kind = 'hist', # Type of plot
    bins = 50,
    figsize = (15, 15) # Size of graph
)
plt.show()

## Manipulating Data: Pandas

This list of Pandas functions will come in handy: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

The table from which we are getting the data from is called a DataFrame. Each of the columns can be thought of as a list or series of data, which can be accessed similar to the way we access elements in a list.

### Grabbing Data
When manipulating our data, it can be helpful to grab specific rows and columns. Here, we use the `head` and `tail` functions to grab the Defense stats of the first 5 Pokemon in our dataset and the Attack stats of the last 5 Pokemon in our dataset.

In [None]:
print(data['Defense'].head(5))
print(data['Attack'].tail(5))

We can access columns with the simple `dot property`. Let's say that we only want to see a Pokemon's name and which row they are in in our dataset. We can simply use `data.Name` to do this!
**Note: Column names are case sensitive!**

In [None]:
names = data.Name
names.head(13)

We can also access rows and columns using the the `iloc` and `loc` functions. Let's say that we want to grab rows 20 to 23 in our dataset. `tail` and `head` functions can't do this, but `iloc` can!

In [None]:
middle_rows = data.iloc[20:24]
middle_rows

Now, let's say we only care about a Pokemon's Types, Attack, and Defense. We can grab these columns with the `loc` function. Notice that to grab columns, we add a `:,` before we specify which columns we want!

In [None]:
columns = data.loc[:, '#' : 'Defense']
columns.head(5)

Another helpful function is the `drop` function. It's pretty self explanatory, so you can read the documentation yourself on the Pandas cheat sheet: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

### Filtering and Concatenating Data

We can also filter data easily. Let's say we want to show all the pokemon with more than 200 defense. We do this by checking if the defense of each row is greater than 200 and then only asking for the rows for which this is true. It turns out that there are only 3 pokemon like this! Change the condition to see what else you can discover.

In [None]:
strong_defenders = data['Defense'] > 200
data[strong_defenders]

What if we want to find all the Pokemon with more than 175 attack? This is easy too!

In [None]:
strong_attackers = data['Attack'] > 175
data[strong_attackers]

If we want to build a super strong Attack/Defense Pokemon team, we can use `pd.concat` to add these two sets of data to form a new dataset!

In [None]:
strong_pokemon = pd.concat([data[strong_defenders], data[strong_attackers]])
strong_pokemon

### Assignment: Build Ash's Team

Now, we have enough tools to build Ash's team! Ash's Pokemon are the following:
-  Pikachu
-  Squirtle
-  Bulbasaur
-  Tauros
-  Lapras
-  Charizard

Use your data analysis skills to grab the rows containing these 6 Pokemon and add them to a new dataframe!

*Hint: Use filtering and concatenation*

In [None]:
pikachu = data[data['Name'] == 'Pikachu']
squirtle = data[data['Name'] == 'Squirtle']
bulbasaur = data[data['Name'] == 'Bulbasaur']
tauros = data[data['Name'] == 'Tauros']
lapras = data[data['Name'] == 'Lapras']
charizard = data[data['Name'] == 'Charizard']

ash_team = pd.concat([pikachu, squirtle, bulbasaur, tauros, lapras, charizard])
ash_team

# entries = data['Name'].isin(['Pikachu', 'Squirtle', 'Bulbasaur', 'Tauros', 'Lapras', 'Charizard'])
# ash_team = data[entries]
# ash_team

### Sorting Data

We successfully built Ash's team! However, it seems like the Pokemon are a bit out of order. Let's fix that with the `pd.sort_values()` function.

In [None]:
#First argument of sort_values is the column by which we are sorting
#Second argument determines whether we want to sort from lowest to highest or highest to lowest.
#Set ascending to True if we want to sort from lowest to highest

ash_team = ash_team.sort_values('#', ascending = True)
ash_team

## Data Analysis: Pandas

Now we can analyze of the data on the pokemon—the easiest way to get an idea of how how data looks is to plot some of it and get a few facts about it. Luckily for us, it is really easy to get the following from our table:

- count: number of entries
- mean: the average numberr, over all pokemon
- std: standard deviation, measures how spread out the data is
- min: minimum entry
- 25%: first quantile (25% are below)
- 50%: median or second quantile (50% are below)
- 75%: third quantile (75% are below)
- max: maximum entry

In [None]:
data.describe()

From this overview, we can easily see the minimimum and maximum for each of these statistics. To visualize this, we often use a boxplot, which allows us to easily picture the average, spread and outliers. The black bars the top and bottom show the maximum and minimum data points. The red line in the middle is at the mean and the box shows the distance from the 25th to the 75th percentile.

In [None]:
data.boxplot(
    column='Attack'
)
plt.show()

Notice that some of our pokemon are legendary while others are not. We might guess that the attack of legendary pokemon are greater than non-legendary pokemon. We could test our hypothesis by creating a boxplot for the attacks of each group. We can easily do this using the `by` keyword.

In [None]:
data.boxplot(
    column='Attack', # Column plotting
    by='Legendary' # Group
)
plt.show()

Now, let's compare Ash's team statistics to the general Pokemon dataset statistics.

In [None]:
ash_team.describe()

In [None]:
data.describe()

From the describe methods, we see that Ash's pokemon have very average stats. If Ash had known this, maybe he would have trained and evolved them so that he could be more competitive in the Pokemon league!

# Extra: Heat Maps

Finally, there's a lot more interesting figures we can make using python. For example, when we plotted attack and defense they seemed to be linked or *correlated*.

In general, we can test how strongly two variables are correlated they are using the "Pearson product-moment correlation coefficient." This number will always be between +1 and -1. More extreme numbers (near $\pm1$) mean the correlation is strong, while numbers closer to zero indicate that the variables are less strongly linked.

We can visualize how all the variables are correlated by plotting these values in a heat map. The more strongly correlated values are more strongly colored.

In [None]:
f, ax = plt.subplots(figsize=(6, 6))
sns.heatmap(
    data.corr(),
    annot=True,
    linewidths=.5,
    fmt= '.1f',
    ax=ax
)
plt.show()

## Assignment: Build your own Pokemon Team
Now, let's put your skills to the test: build your ultimate Pokemon team. Use your data analysis skills to find 6 Pokemon with the best combination of attack, defense, speed, etc. to defeat your opponents. A good team should have the following qualities:
-  A wide variety of Pokemon types, so your team won't have any weaknesses
-  High stat totals, so you can outclass your opponents

Create a new dataframe that contains only your 6 Pokemon using the Pokemon dataset that is provided. Then, plot some graphs that show off your team's statistics.

In [None]:
#Build the best 6 Pokemon team you possibly can.

### YOUR CODE HERE ###

#Plot some graphs that show off your team's statistics.

### YOUR CODE HERE ###

## Challenge Assignment: Sahai, the Shapeshifting Pokemon
UC Berkeley Pokemon trainers have been beating Stanford Pokemon trainers for decades. Recently, however, Stanford has acquired some powerful Pokemon that are difficult to beat. Thus, Berkeley Labs are trying to synthesize a new, ultimate Pokemon that can shapeshift its stats to beat any opponent.

The shapeshifting Pokemon, named "Sahai", has three shape-shifting forms: Attack, Defense, and Speed. These shapeshifting forms are a set of statistics that are "stolen" from another Pokemon. Each form has the following criteria:
-  The stat line is "stolen" from another Pokemon (Ex. Attack form may be Pikachu's exact statistics)
-  The stat line for Attack form must have attack in the top 10 of all the Pokemon in the dataset, the stat line for Defense form must have defense in the top 10 of all the Pokemon in the dataset, etc.
-  No stat line can have any statistic that is below average

Create a new dataframe that contains the three forms that Sahai can shapeshift into. Then, write a function that takes in the stats of the Pokemon that it is battling against, and returns the shapeshifting form that it will take on to win the battle.

In [None]:
#Build the Sahai shapeshifting dataframe

### YOUR CODE HERE ###

#Write the shapeshifting function

def shapeshift(opponent_stats):
    ### YOUR CODE HERE ###