# Pandas

Pandas is a powerful open-source Python library that provides flexible and efficient tools for working with structured data. We're going to use it to clean and transform data to get the outputs we want.

# Install

Pandas should already be installed on your environment. If it's not, you'll need to run the cell below to install it, which might take a few minutes.

In [None]:
!pip install pandas

# Import

To use pandas and all the methods it has we need to import it to our environment. Run this cell first to check you have it installed, if not, run the one above.

In [None]:
import pandas as pd

# Check pandas version
print(f"Pandas version: {pd.__version__}")

# Create a dataframe

First, we will need a dataframe to work with. The easiest thing to do is create our own!

Here's how to manually create a pandas dataframe:

In [None]:
df = pd.DataFrame({
    "numbers": [1, 2, 3],
    "letters": ["A", "B", "C"],
    "fruits": ["apples", "bananas", "cherries"]
})

Notebooks allow us to see the dataframe in a rich text format, all we need to do is run the variable.

In [None]:
df

Want to add a column? Easy.

In [None]:
df["vegetables"] = ["artichokes", "brussel sprouts", "cabbages"]
df

Remove a column? Also easy.

In [None]:
df = df.drop("numbers", axis=1) # axis is 0 (rows) by default
df

If we had many columns we could also select the columns we wanted to keep:

In [None]:
df = df[["fruits", "vegetables"]] # Using the double square bracket
df

Let's import some data from elsewhere: use the `pd.read_csv()` method.

What do you notice about the data?

In [None]:
file_path = "pandas_data.csv"
data_df = pd.read_csv(file_path)
data_df

Hopefully you noticed the missing values! We will call these Nulls.

Let's find out more about the dataframe using these methods:

In [None]:
data_df.info() # This tells us some info about the columns, their type and number of null values
data_df.describe() # This tells us more statistical information about the dataframe

Null values can cause a lot of trouble in our code, so we have to be careful how to handle them.

For our purpose, let's bin them. 

In [None]:
data_df = data_df.dropna() # This will drop any rows with a Null value
data_df.info() # comparing to above, you can see they have been dropped.

# Duplicates

We can't know unless we try, but we should remember when cleaning data that it is possible for some of it to be duplicated.

We can look for duplicates, and then drop them.

In [None]:
data_df.duplicated() # Prints out a boolean value if each row is duplicated or not

In [None]:
data_df = data_df.drop_duplicates() # This will drop any duplicate rows
data_df.info()

# Sort and filter dataframes

We can sort and filter dataframes based on any condition:

In [None]:
data_df = data_df.sort_values("Age", ascending=True) # ascending is True by default
data_df

In [None]:
data_df = data_df[data_df["Age"] >= 18] # Filtering for rows where the Age column is more than or equal to 18
data_df

We can filter for multiple things at once, and sort at the end, watch the brackets:

In [None]:
data_df = data_df[(data_df["Age"] >= 18) & (data_df["Mass"] >= 60) & (data_df["Sex"].isin([1, 0]))].sort_values("Mass")
data_df

# Grouping

Below, we are grouping on the "Sex" column, to find out how many instances there are of each value.

We then reset the index and rename the count column.

In [None]:
grouped_data_df = data_df.groupby(["Sex"]).size().reset_index(name="Count")
grouped_data_df

# Joining

Sometimes we might need to join dataframe together, for example to get the exact names of trusts basd off of their codes.

Here there is another dataframe with the code of 0 and 1 to indicate Male and Female. We'll join it onto the data_df we have been working with.

Joining dataframes together is simple, we just need to pick which columns to merge on.

In [None]:
sex_indicator_df = pd.DataFrame({
    "code": [0, 1],
    "sex": ["Male", "Female"]
})

sex_indicator_df

In [None]:
joined_df = pd.merge(data_df, sex_indicator_df, left_on='Sex', right_on='code', how='inner')
joined_df

And now we can drop any columns we don't want anymore!

In [None]:
joined_df = joined_df.drop(['Sex', 'code'], axis=1)
joined_df

# Graphs

Pandas dataframes allows us to make some quick and dirty graphs! We'll use a library called `matplotlib.pyplot` to help us. 



In [None]:
joined_df.plot(x='Mass', y='Age', kind='bar', title='Age vs Mass')