# Challenge 1

In this challenge you will be working on **Pokemon**. You will answer a series of questions in order to practice dataframe calculation, aggregation, and transformation.

![Pokemon](../images/pokemon.jpg)

Follow the instructions below and enter your code.

#### Import all required libraries.

In [None]:
# import libraries
import pandas as pd
import numpy as np
import sqlalchemy
import re

#### Import data set.

Import data set `Pokemon` from Ironhack's database. Read the data into a dataframe called `pokemon`.

*Data set attributed to [Alberto Barradas](https://www.kaggle.com/abcsds/pokemon/)*

In [None]:
pokemon = pd.read_csv('pokemon_stats.csv', index_col=0)

#### Print first 10 rows of `pokemon`.

In [None]:
# your code here
pokemon.head(10)

When you look at a data set, you often wonder what each column means. Some open-source data sets provide descriptions of the data set. In many cases, data descriptions are extremely useful for data analysts to perform work efficiently and successfully.

For the `Pokemon.csv` data set, fortunately, the owner provided descriptions which you can see [here](https://www.kaggle.com/abcsds/pokemon/home). For your convenience, we are including the descriptions below. Read the descriptions and understand what each column means. This knowledge is helpful in your work with the data.

| Column | Description |
| --- | --- |
| # | ID for each pokemon |
| Name | Name of each pokemon |
| Type 1 | Each pokemon has a type, this determines weakness/resistance to attacks |
| Type 2 | Some pokemon are dual type and have 2 |
| Total | A general guide to how strong a pokemon is |
| HP | Hit points, or health, defines how much damage a pokemon can withstand before fainting |
| Attack | The base modifier for normal attacks (eg. Scratch, Punch) |
| Defense | The base damage resistance against normal attacks |
| SP Atk | Special attack, the base modifier for special attacks (e.g. fire blast, bubble beam) |
| SP Def | The base damage resistance against special attacks |
| Speed | Determines which pokemon attacks first each round |
| Generation | Number of generation |
| Legendary | True if Legendary Pokemon False if not |

#### Obtain the distinct values across `Type 1` and `Type 2`.

Exctract all the values in `Type 1` and `Type 2`. Then create an array containing the distinct values across both fields.

**Solution:** What set() does is automatically remove any duplicates since a set in python is a collection of unique elements. We then take the union (by using the operator |) of the two sets to have all the distinct values from both Type 1 and Type2. We then simply remove the nan value that comes from the fact that some Pokemons do not have a Type 2.

In [None]:
# your code here
types_list = list(set(pokemon['Type 1'])|set(pokemon['Type 2']))

In [None]:
types_list

In [None]:
types_list = types_list[1:]

In [None]:
types_list

#### Cleanup `Name` that contain "Mega".

If you have checked out the pokemon names carefully enough, you should have found there are junk texts in the pokemon names which contain "Mega". We want to clean up the pokemon names. For instance, "VenusaurMega Venusaur" should be "Mega Venusaur", and "CharizardMega Charizard X" should be "Mega Charizard X".

**Solution:** There's many ways to do this as you can imagine. One of the option is to use a regular expression (regex) to match everything before the word "Mega" and remove it.

The specific pattern we will be using is `.*(?=Mega )`. You can find more details about what each element does here: https://regex101.com/r/xGBeqZ/2/.

We use the `apply` method in pandas to apply a function to all the elements of the column `Name`. The function we apply is the `sub` function of the Python RegEx module `re`, which simply replaces every match of the pattern with whatever string we give it, in this case an empty one.

In [None]:
# your code here
pattern = '.+?(?=Mega )'

pokemon['Name'] = pokemon['Name'].apply(lambda x: re.sub(pattern, '', x))
# test transformed data
pokemon.head(10)

#### Create a new column called `A/D Ratio` whose value equals to `Attack` devided by `Defense`.

For instance, if a pokemon has the Attack score 49 and Defense score 49, the corresponding `A/D Ratio` is 49/49=1.

**Solution:** Pandas, being based on numpy for the most part, allows the same broadcasting rules as a numpy array. This means that if you use an operation between two Pandas Series they will be evaluated element-wise (this means that the operation given will be applied to the element number 1 of the two Series, then to element number 2 and so on and so forth).

In [None]:
# your code here
pokemon['A/D Ratio'] = pokemon['Attack']/pokemon['Defense']
pokemon.head()

#### Identify the pokemon with the highest `A/D Ratio`.

**Solution:** We can use the idxmax() method that returns the row label of the row that contains the maximum value for a given column. We then simply access the entire row by using the .loc method.

In [None]:
# your code here
pokemon['A/D Ratio'].idxmax()

In [None]:
pokemon.loc[pokemon['A/D Ratio'].idxmax()]

#### Identify the pokemon with the lowest A/D Ratio.

**Solution:** Same thing, just with idxmin().

In [None]:
# your code here
pokemon.iloc[pokemon['A/D Ratio'].idxmin()]

#### Create a new column called `Combo Type` whose value combines `Type 1` and `Type 2`.

Rules:

* If both `Type 1` and `Type 2` have valid values, the `Combo Type` value should contain both values in the form of `<Type 1> <Type 2>`. For example, if `Type 1` value is `Grass` and `Type 2` value is `Poison`, `Combo Type` will be `Grass-Poison`.

* If `Type 1` has valid value but `Type 2` is not, `Combo Type` will be the same as `Type 1`. For example, if `Type 1` is `Fire` whereas `Type 2` is `NaN`, `Combo Type` will be `Fire`.

**Solution:** This is another problem where a good solution includes using the apply method with a function written specifically for our purpose. For each row, as explained above, there are two things that can happen:

1. `Type 2` is a NaN and therefore `Combo Type` is equal to `Type 1`
2. `Type 2` has a valid value and therefore `Combo Type` is equal to `Type 1-Type 2` 

We can therefore write a simple function and then apply it to each row. The function will have to return `Type 1` in the first case and `Type 1-Type 2` in the second case.

In [None]:
def combotypes(row):
    if pd.isna(row['Type 2']):
        return row['Type 1']
    else:
        return f"{row['Type 1']}-{row['Type 2']}"

We then pass this function to the apply method and assign the result to a new column called `Combo Type`. Note that we are using the parameter `axis=1` to tell Pandas to apply this function row-by-row and not column-by-index.

In [None]:
pokemon['Combo Type'] = pokemon.apply(combotypes, axis=1)

Given how simple the function we're using is, we could also create a lambda function that does the same thing.

In [None]:
combotypes2 = lambda row: row['Type 1'] if pd.isna(row['Type 2']) else f"{row['Type 1']}-{row['Type 2']}"

In [None]:
pokemon.head()

#### Identify the pokemon whose `A/D Ratio` are among the top 5.

**Solution:** We can simply sort the dataframe by the values of the `A/D Ratio` column and then look at the first 5 rows.

In [None]:
# your code here
pokemon.sort_values('A/D Ratio', ascending=False).head(5)

#### For the 5 pokemon printed above, aggregate `Combo Type` and use a list to store the unique values.

Your end product is a list containing the distinct `Combo Type` values of the 5 pokemon with the highest `A/D Ratio`.

**Solution:** We already found the 5 pokemons with the highest A/D Ratio, we just need to extract the unique values. We can use a list comprehension.

In [None]:
# your code here
top_5_adratio = pokemon.sort_values('A/D Ratio', ascending=False).head(5)

combo_type_top_adratio = [x for x in top_5_adratio['Combo Type'].unique()]
combo_type_top_adratio

#### For each of the `Combo Type` values obtained from the previous question, calculate the mean scores of all numeric fields across all pokemon.

Your output should look like below:

![Aggregate](../images/aggregated-mean.png)

**Solution:** We can group the values by the `Combo Type` column and aggregate them using the mean() function

In [None]:
# your code here
pokemon.groupby('Combo Type').mean().loc[combo_type_top_adratio]