# Lesson 10
Previously we discussed some of the other built-in data structures that Python has to offer. We went over tuples, sets, and most importantly: dictionaries. In this lesson we will begin working with the data module for Python known as pandas.

Pandas is a data structure module that allows us to work with structures called DataFrames. These DataFrames hold data similarly to a 2-D array. They structure our data into different rows and columns based on the information provided.

Before taking a deep dive into pandas, let's go over how a DataFrame is represented using a dictionary and as a 2-D List.

In [None]:
df_dict = {
    "Name": ["John Smith", "Jane Doe", "Alfred Pennyworth", "Bruce Wayne"],
    "Age": [20, 30, 65, 50],
    "Occupation": ["Businessman", "Civil Engineer", "Butler", "Philanthropist"]
}

df_arr = [
    ["John Smith", "Jane Doe", "Alfred Pennyworth", "Bruce Wayne"],
    [20, 30, 65, 50],
    ["Businessman", "Civil Engineer", "Butler", "Philanthropist"]
]

In the first example, is a dictionary in which the keys describe what they represent such as name, age, and occupation. In the second example, we have the same thing except it's a list of lists that represents a group of names, ages, and occupations.

We can access them in similar ways using indexing like we always have, where the index of a dictionary is the key, and the index of a list is a number value starting from 0 and ending at the number 1 less than the length of the list.

Feel free to access other elements in the previous structures using the code below

In [None]:
names1 = df_dict["Name"]
names2 = df_arr[0]
print(names1, '\n', names2)

## Creating our First DataFrame
So, to get started with pandas, first thing we have to do is install the pandas module using the `pip` package manager. Run the following command in your terminal:

`pip install pandas`

Once pandas is installed, we then have to import it. It is customary the module is imported as the variable `pd`. We will create DataFrames below using the previous DataFrame dictionary we created earlier

In [None]:
import pandas as pd

df_dict = {
    "Name": ["John Smith", "Jane Doe", "Albert Pennyworth", "Bruce Wayne"],
    "Age": [20, 30, 65, 50],
    "Occupation": ["Businessman", "Civil Engineer", "Butler", "Philanthropist"]
}


df = pd.DataFrame(df_dict)
print(df)

## Creating our First Data Visualization

As you can see above from the code being run, we have a DataFrame with data inside it. The data is separated into rows and columns. The rows being numbered 1 to however many there are, and columns having the names of the keys of our dictionary.

Like other data related languages, pandas also has a way to use outside data sources. There is a student.csv file in our directory, we will use this data going forward. Let's begin working with this data now.

We will also be creating a bar graph that will represent our data. We do this by using the matplotlib.pyplot module. Install it below using the following command:

`pip install matplotlib`

In the import statement, you will notice we call it as a part of the matplotlib library.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

file = "student.csv"
df = pd.read_csv(file)

"""
Let's clean this up
1)  We can get rid of the id column, since our indices already have the number associated with the student.
    To do this, we can use the drop() function
"""

df = df.drop(columns='id')

"""
As you can see in our output we no longer have a column for id numbers.
To isolate different columns, we just use the same [] brackets like we do for dictionaries and
specify the name of the column.
"""

classes = df['class']

"""
We should probably clean this up a little as these classes are a little wonky looking.
Let's normalize the names of our classes.

Since we know the data we can change the name of the one outlier in there which is the "Fifth" class and change it to "Five" since that would fit with our naming scheme. Let's take a look at this method below

We use the replace function to change the name of one class value to another to match with everything else in our dataframe
"""

df['class'] = df['class'].replace('Fifth', 'Five')

"""
We have a normalized naming scheme for our classes now. Let's now create a list of averages for the marks in each class within our dataframe. First, let's create our x values, which will go on our x axis of our graph.
"""

xes: list[str] = list({cls for cls in list(df['class'])})

"""
Here we created a list of all the classes without any repeats, these will be our x values when we plot our bar graph

Let's now get the averages for each class in our list.

We do this by grouping each class together in our dataframe, and then average out the marks for each class using the mean() function.

We can now create a bar graph using the matplotlib.pyplot library using the bar() function.
"""

yes: list[float] = list(df.groupby('class')['mark'].mean())

# Here all we do is declare title, x-label, and y-label to pretty up the bar graph
plt.title("Class Averages")
plt.xlabel("Class")
plt.ylabel("Average Marks")
plt.bar(xes, yes)

Below will be all the code we used put together into a more compact and readable format.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Read the csv file which will contain our dataframe
df: pd.DataFrame = pd.read_csv('student.csv')

# Drop unnecessary columns
df.drop(columns='id')

# Normalize any values to fit with our dataframe naming scheme
df['class'] = df['class'].replace('Fifth', 'Five')

# Create our list of x-values for our graph which will be the names of the classes
xes = list({cls for cls in list(df['class'])})

# Create our list of y-values for our graph which will be the average marks for each class
yes = list(df.groupby('class')['mark'].mean())

# Format our graph and output it to the screen
plt.title("Class Averages")
plt.xlabel("Class")
plt.ylabel("Average Marks")
plt.bar(xes, yes)

This is our intro into pandas. We covered quite a lot:
1) DataFrames
2) Importing data
3) Cleaning data
4) Creating Data Visualizations

There is much more to this module and the other data libraries.

## Assignment
Create a new Python file in your project directory and fill out the template below by creating a function called clean_data and follow the guidelines below
1) Group the dataframe by the column name "Series_title_1" by using the `groupby()` function
2) Take 10 random groups from the `GroupBy` object created from the previous step by using `random.sample()`
    * You will have to convert the `GroupBy` object into a list in order to take a sample
3) Create a new dataframe that will contain the information from the random groups you took in the previous step
4) Repeat step 1 and create another `GroupBy` object on the dataframe created in the previous step
5) Your x-values will be names of the groups in the `GroupBy` object created in the previous step
6) Your y-values will be the mean values of of each group's "Data_value" column
7) Return your x and y values.

I will create the display_data function for you along with reading the csv file that you will be using for the assignment. I have also provided some type annotations to help you understand what you need to return for this program to function correctly.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import random


def clean_data(df: pd.DataFrame) -> tuple[list[str], list[float]]:
    # TODO: clean the dataframe provided
    pass

def display_data(xes: list[str], yes: list[float]) -> None:
    plt.title("Service Means")
    plt.xticks(range(len(xes)), xes, rotation=45)
    plt.ylabel("Mean Values")
    plt.bar(xes, yes)
    plt.show()

if __name__ == "__main__":
    file: str = "cpi.csv"
    df: pd.DataFrame = pd.read_csv(file)
    xes, yes = clean_data(df)
    display_data(xes, yes)