# Exploratory Data Analysis (EDA)

EDA is the process of examining and visualizing a dataset to understand its characteristics and uncover relationships between variables. It is a crucial step in the data analysis process. Working with data without performing EDA is like driving a car with your eyes closed. You might get to your destination, but you're more likely to crash along the way.

In this notebook, we will take a look at two datasets: orange quality dataset and pokemon stats dataset. We will perform EDA to understand their structure and uncover relationships between variables. Pandas and Seaborn are probably the most commonly used libraries for EDA in Python, and we will use them in this notebook. Let's import them both.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt  # Not necessary, but can be useful for customizing seaborn plots

import warnings # This library is used to ignore warnings, don't worry about it for now
warnings.filterwarnings('ignore')

# Introduction to Pandas

pandas.DataFrame is a 2-dimensional data structure with columns of potentially different types. You can think of it like an excel spreadsheet. It is generally the most commonly used pandas object.

<br/><br/>
<center>
<img src="figures/characters.png" height="250">
</center>
<br/><br/>


Let's imagine we want to create a DataFrame that contains the information about characters from the **critically-acclaimed 2019 game Disco Elysium**. This would be an example to learn how to create a DataFrame from scratch and perform some basic operations on it.

In [None]:
# First, initialize an empty DataFrame

df = pd.DataFrame()

# Now, let's create the columns as lists

first_names = ['Harry', 'Kim', 'Lawrence', 'Joyce', 'Jules', 'Goracy']
last_names =  ['Du Bois', 'Kitsuragi', 'Garte', 'Messier', 'Pidieu', 'Kubek']
ages = [44, 43, 28, 48, 68, 39]
occupations = ['Cop', 'Cop', 'Bartender', 'Landlady', 'Cop', 'Cook']
origins = ['Revachol', 'Revachol', 'Revachol', 'Revachol', 'Unknown', 'Graad']

# Let's add the columns to the DataFrame

df['first_name'] = first_names
df['last_name'] = last_names
df['age'] = ages
df['occupation'] = occupations
df['origin'] = origins

# Let's take a peek at the first 5 rows of the DataFrame. Another method, tail(), would show the last 5 rows.
df.head()

In [None]:
# We can also create a DataFrame from a dictionary

data = {'first_name': first_names, 
        'last_name': last_names, 
        'age': ages, 
        'occupation': occupations,
        'origin': origins}

df = pd.DataFrame(data)
df.head() 

# It's the same as before

In [None]:
# We can access the columns of the DataFrame using the column name. This will return a pandas Series, which is a 
# one-dimensional labeled array (similar to a Python dictionary). It is the building block of pandas DataFrames and 
# contains index values [0, 1, 2, ..., n] as the 'dictionary keys' and our data as the 'dictionary values'. 

# You won't need to work with pandas Series objects as often as with DataFrames, but it's good to know they exist.

# Let's see the first names of the characters.

first_names = df['first_name']
print(f"What we got is a pandas.Series object: {type(first_names)}")
print(first_names, '\n')

# pandas.Series can be converted to an ordinary Python list using the tolist() method
names_list = first_names.tolist()
print(f"Now it's a Python list: {type(names_list)}")
print(names_list)

# Basic DataFrame operations

Let's review some basic operations that can be performed on a pandas DataFrame.  
You may wonder how to append a new row to an existing DataFrame. Unfortunately, it is not as straightforward as appending a new element to a Python list, as **DataFrames are not designed to grow in size dynamically**, but you can still do it.

<br/><br/>
<center>
<img src="figures/add_character.png" height="215">
</center>

In [None]:
# One approach is to create a new DataFrame containing only the new row and then concatenate (meaning - join by rows) 
# it with the original DataFrame.

# Let's create a dictionary with the information about a new character.
new_character = {'first_name': 'Klaasje', 'last_name': 'Amandou', 'age': 28, 'occupation': 'Corporate spy', 'origin': 'Oranje'}

# Create a new DataFrame with the new character
df_new = pd.DataFrame(new_character, index=[0]) # The index parameter is irrelevant in this case

# Concatenate the original DataFrame with the new DataFrame
df = pd.concat([df, df_new], ignore_index=True)
df.tail()

In [None]:
# Suppose we only want to see the last names and occupations of the characters. 
# We can do this by selecting the columns we want.

name_and_occupation = df[['last_name', 'occupation']]
name_and_occupation.head()

In [None]:
# We can also select rows based on a condition. For example, we can select only the characters who are cops.

cops_only = df[df['occupation'] == 'Cop']
cops_only.head()

In [None]:
# If we wanted to see some statistics about the dataframe values, we would use the describe() method

df['age'].describe()

In [None]:
# We can also sort the DataFrame by values in a column. Let's sort the characters by age.

df_sorted = df.sort_values(by='age')
df_sorted.head()

In [None]:
# As time passes, people usually get older. This does not really happen in the game, as the story spans only a few 
# days. Nevertheless, let's see what the characters' ages will be in 10 years and add this information to the DataFrame.

df['age_in_10_yrs'] = df['age'] + 10
df.head()

In [None]:
# If we wanted to perform a more complex operation on the 'age' column, we could use the apply() method.

# You can write a custom function that takes a value from a column, does something with it, and returns the result. 
# Then, you can apply this function to the column.

def describe_age(x):
    if x < 30:
        return 'Young'
    elif x < 60:
        return 'Middle-aged'
    else:
        return 'Elderly'

df['age_categorical'] = df['age'].apply(describe_age)
df.head()

## Iterating over rows in a DataFrame (don't do it!)

When working with pandas DataFrames it is advised **not to use 'for' loops to iterate over rows**. In contrast to Python lists, pandas DataFrames are optimized for vectorized operations. This means that you can apply operations to entire columns at once, which is much faster than iterating over rows. If you find yourself iterating over rows in a DataFrame, you are probably doing something wrong. There is almost always a better, more efficient way to do it using pandas methods.

Nevertheless, be aware of the iterrows() method. Here is an example of how it could be used:

In [None]:
print('My favorite characters of Disco Elysium:')
for index, row in df.iterrows():
    first_name = row['first_name']
    last_name = row['last_name']
    job = row['occupation']
    print(f'{index}: {first_name} {last_name}, who is a {job}')

## Get familiar with pandas documentation!

In this notebook we explored only a fraction of pandas.DataFrame methods. You should get familiar with [pandas documentation](https://pandas.pydata.org/docs/reference/frame.html) and play around with the methods to get a better understanding of what you can do with pandas. Go ahead and try some of the methods on the DataFrame we created in this notebook.

<br />

***

# Introduction to Seaborn


Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. Seaborn is built on top of matplotlib and closely integrated with pandas data structures.

For this example, we will use the orange quality dataset. This dataset contains information about the quality of some oranges, including the variety, weight, size, and sweetness of the oranges. Let's load the dataset and take a look at the first few rows.

In [None]:
df = pd.read_csv('data/orange_quality.csv')
df.head()

In [None]:
# Suppose we are only interested in oranges that are of variety 'Temple', 'Satsuma Mandarin' or 'Moro (Blood)'.
# We can filter the DataFrame to only include these varieties.

df_filtered = df[df['Variety'].isin(['Temple', 'Satsuma Mandarin', 'Moro (Blood)'])]

In [None]:
# Let's see how the time of harvest correlates with the weight of oranges. We will plot the relationship between 
# 'HarvestTime (days)' and 'Weight (g)' columns. 

# We can use a scatter plot for this. To  color the points based on the 'Variety' column - it can be done by passing 
# hue='Variety' parameter to the scatterplot function.

sns.set_style('white') # Try 'darkgrid', 'whitegrid', 'dark', 'white', 'ticks'
sns.set_context('talk') # Try 'paper', 'poster', 'notebook' and see how it changes the looks

scatterplot = sns.scatterplot(data=df_filtered, x='HarvestTime (days)', y='Weight (g)', hue='Variety')

sns.move_legend(scatterplot, "upper left", bbox_to_anchor=(1, 1)) # Move the legend outside the plot.

In [None]:
# A boxplot is a standardized way of displaying the distribution of data based on a five-number summary: minimum, first 
# quartile, median, third quartile, and maximum. Let's plot the 'Brix (Sweetness)' values using a boxplot.

sns.set_style('white')
sns.set_context('talk')

boxplot = sns.boxplot(data=df_filtered, x='Brix (Sweetness)', y='Variety')
plt.ylabel('') # Remove the y-axis label (Variety) just to make the plot look cleaner

In [None]:
# Let's plot a distribution of the 'Size (cm)' values. We can use a kernel density estimate (KDE) plot for this. A KDE 
# plot is a non-parametric way to estimate the probability density function of a random variable. It's like a smoothed 
# version of a histogram.

sns.set_style('white')
sns.set_context('talk')

kdeplot = sns.kdeplot(df_filtered, x='Size (cm)', hue='Variety', fill=True)

# By passing hue='Variety' parameter to the kdeplot function, we color the plot based on the 'Variety' column. The 
# fill=True parameter fills the area under the curve with color.

sns.move_legend(kdeplot, "upper left", bbox_to_anchor=(1, 1)) # Move the legend outside the plot

In [None]:
# A pairplot is a great way to explore and visualize relationships between variables in a dataset. It creates a matrix 
# of axes and shows the relationship for each pair of columns in a DataFrame. Let's create a pairplot for the orange 
# dataset, with a focus on 'Size (cm)', 'Weight (g)' and 'pH (Acidity)' columns. We will also color the points based on 
# the 'Variety' column.

# We select the columns we are interested in seeing
df_selected = df_filtered[['Size (cm)', 'Weight (g)', 'pH (Acidity)', 'Variety']]

pairplot = sns.pairplot(df_selected, hue="Variety")

# Exercise 1 (Pokemon stats)

Feel free to work with [pandas documentation](https://pandas.pydata.org/docs/reference/frame.html) and [seaborn documentation](https://seaborn.pydata.org/tutorial.html) to complete the following tasks.


In this exercise we will work with a dataset containing the stats of different pokemons. The dataset is stored in a CSV file `data/pokemon_data.csv`.

<br/><br/>
<center>
<img src="figures/pokemon_types.png" height="250">
</center>
<br/><br/>

1. Load the dataset into a DataFrame and take a look at the first few rows.  
2. How many different types of pokemons are there in the dataset? (look for them in column 'type1')
3. Prepare a pandas DataFrame describing the number of pokemons for each pokemon type (column 'type1'). The first column should contain the types and the second column should contain the number of pokemons of that type. Hint: There is a pandas.DataFrame method that does exactly this in one line of code.
4. Plot a bar chart showing the number of pokemons for ten most common pokemon types. The x-axis should contain the pokemon types and the y-axis should contain the number of pokemons.
5. Prepare a boxplot showing the distribution of 'attack' values for ten most common pokemon types.
6. Prepare a pairplot showing the relationship between 'attack', 'hp' and 'catch_rate' columns for 'Rock', 'Dragon' and 'Bug' pokemon types. Color the points based on the type. Apply a custom color palette to the plot, assigning an appropriate color to each type (see the image above). You will probably want to read the [pairplot documentation](https://seaborn.pydata.org/generated/seaborn.pairplot.html) and look for for palette parameter.
