<a href="https://colab.research.google.com/github/nosher150/Module7-notebook/blob/main/Introducing_Dataframes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Pandas

Python is a jack of all trades language that allows you to build functions that can do almost anything you want. But how what benefits does it have for our work as data analysts?<br> <br>It turns out, quite alot!<br><br> You can perform any of the data cleaning, analysis or visualisation techniques you have learned already in a way that is much more efficient and fine tuned to what you need. This is where pandas come in. <br><br> Pandas is an open source library that is used to mainly for data analysis. It allows us to import and view data, as well as perform a wide range of explorative or analytical techniques. Ontop of this, pandas will be used as the base building block of everything we do in subsequent modules. <br><br> In this workbook we will show you how to import data with pandas and how to perform some basic data cleaning.

In [None]:
# Importing pandas is the same as any other library. 
# Notice that the alias pd is used, this is an industry standard and something you should get in the habit of doing

import pandas as pd

Before we go further we need to define a couple of terms within pandas<br><br>

## Series

Technically, a series is a one dimensional array holding data of any type. Simply, if you imagine a pandas object as a table, then a series is a column. 

In [None]:
a=['Jess','Garfield','Snowball II']

my_series=pd.Series(a)

print(my_series)

Like a list, a series is 0-indexed so if we wanted to retrieve any element of a series we just need to call the relevant index

In [None]:
print(my_series[1])

Unlike a list, we can actually name the indices within a series

In [None]:
my_series=pd.Series(a,index=['Postman Pat','Jon Arbuckle','Homer Simpson'])
print(my_series)

In [None]:
print(my_series['Postman Pat'])

## Dataframe

If a series is a column, then a dataframe is the table itself. These are two-dimensional arrays, which consist of at least one series. For many people it is easier to think of dataframes as tables with columns and rows. <br><br>

There are several ways we can build a dataframe manually, one of the simplest is to use a dictionary.

In [None]:
data={'Owner':['Postman Pat','Jon Arbuckle','Homer Simpson'],'Pet':['Jess','Garfield','Snowball II']}

df=pd.DataFrame(data)

df

Notice that the keys from the dictionary are now the column names, we will come back to this later to show you how you can change column names. <br><br>

Most of the time though, we aren't going to be building dataframes manually. Normally we will have want to load the dataset in. Pandas has inbuilt functions for several types of file, which you can read about in their <a href='https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html'>documentation</a>. The one we will be using in this course is how to read in a CSV file. 

In [None]:
# The dataset we will be using in this workbook contains records of over 80000 UFO sightings
df=pd.read_csv('Data/ufo_sighting_data.csv')

Note that within a series, all data types must be the same. If this isn't consistent (as in this case), pandas will automatically force them to be the same. In this case it will set the conflicting column to an object (string).

If you would like to read about importing Excel files, you can read the documentation <a href='https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html'>here</a>.

# Inspecting the Data

Now we know how to load our data into pandas, it is time to look at the various functions pandas has for inspecting the data. <br><br>

As before, whenever you receive data you should inspect it to understand what it contains, what you can do with it and how it might need to be cleaned. In this section we will show you some basic pandas functions for inspecting our data.

In [None]:
# .head() allows us to view the first 5 rows of a dataframe, this can be extended to a maximum of 20

df.head()

In [None]:
# .tail() does the same thing, but for the last 5 rows
# Both functions are useful for getting a quick look at the data to see what is contained

df.tail()

In [None]:
# To understand how big our dataframe is, we can use .shape to return the number of rows and columns

df.shape

In [None]:
# For a quick view of what data types are present, use .dtypes

df.dtypes

In [None]:
# For a quick summary of what is happening in each series (column) you can use .describe()
# By default this will only display series where the datatype is numeric (float or integer)

df.describe()

In [None]:
# To look at series with non-numeric datatypes you can set the parameter include to 'all'
# Note that for non-numeric datatypes it uses new summaries (unique, top, freq) and does not include those for numerics

df.describe(include='all')

In [None]:
# For a more overall look at your different columns, use .info()
# This shows for each column what the datatype is and how many non-nulls there are

df.info()

In [None]:
# To view the column names, you use .columns

df.columns

You may have noticed that some functions have parenthesis at the end, while others don't. Functions with parenthesis are performing some sort of calculation or procedure, while those that don't are just printing what is already there.  



# Viewing Specific Data

So far we have looked at ways of inspecting the whole dataframe, but what if we wanted to look at something more specific?

In [None]:
# To look at a specific column, you can use the syntax df.column_name
# Note this returns a series

df.city

In [None]:
# Alternatively, you can also use the syntax df[column_name]
# This again will return a series, this is the only viable method for column names with whitespace, e.g. df['column name']

df['city']

In [None]:
# If you want to return a column in a new dataframe, then you need to double wrap the square brackets

df[['city']]

In [None]:
# This method therefore allows you to return more than one column as a dataframe is two-dimensional

df[['city','country']]

In [None]:
# A very powerful function for returning specific parts of a dataframe is to use .loc
# Think of it like using coordinates (row and column names) to specifically retrieve what you would like
# It can be used to retreive a single value, a column, row or a subset of the dataframe

df.loc[0:10,'city':'country']

In [None]:
# Another method is to use .iloc, which uses indices instead of names to retrieve specific parts of a dataframe

df.iloc[0:10,1:4]

In [None]:
# To look at the frequency each item appears in a column we can use .value_counts()

df.country.value_counts()

In [None]:
# Finally, if you would like to see what the unique values are in a column you can use .unique()
# Mostly useful for categorical data

df.country.unique()

In [None]:
# Practice: What city reported the most sightings?

#A:

Click here to view solution<br><br>

<p style=color:white> df.city.value_counts()</p>

# Filtering

Now we know how to look at data within our dataframes, let's see how we can filter them to look at specific strata within the data

In [None]:
# The syntax for filtering data starts the exact same way as selecting a series, except now we add a condition
# This returns for each value a True/False whether that condition has been met

df.country=='us'

In [None]:
# To then return a dataframe that returns the rows where this condition is True we use the following syntax

df[df.country=='us'].head()

You can use any type of condition that you have learned before, including >, <, >=, <=, !=

In [None]:
# If you would like to add more than one condition you need to wrap each in parenthesis
# If you want both conditions to apply you use & between the conditions

df[(df.country=='us')&(df.UFO_shape=='circle')].head()

In [None]:
# If you want either condition to be applied you place | between conditions

df[(df.country=='us')|(df.country=='ca')].head()

# Practice

In [None]:
#1. Filter the dataframe to show UFO_shapes that are circle

#A:

Click here to view solution<br><br>

<p style=color:white> df[df.UFO_shape=='circle'] </p>

In [None]:
#2. Filter the dataframe to only show UFO sightings from Texas (tx)

#A:

Click here to view solution<br><br>

<p style=color:white> df[df['state/province']=='tx'] </p>

In [None]:
#3. Filter the data to show sightings that are either cylindrical or circular

#A:

Click here to view solution<br><br>

<p style=color:white> df[(df.UFO_shape=='circle')|(df.UFO_shape=='cylinder')] </p>

In [None]:
#4. Filter the data to show sightings from the UK (gb) and last 300 seconds

#A: 

Click here to view solution<br><br>

<p style=color:white> df[(df.country=='gb')&(df.length_of_encounter_seconds==300)] </p>

In [None]:
#5. (Stretch) Filter the data to show sightings that are from outside the United States and are either spherical or light

#A: 

Click here to view solution<br><br>

<p style=color:white> df[(df.country!='us')&(df.UFO_shape=='sphere')|(df.UFO_shape=='light')] </p>

In [None]:
#6. (Stretch) What is the most common UFO_shape in the United States?

#A:

Click here to view solution<br><br>

<p style=color:white> df[df.country=='us'].UFO_shape.value_counts() </p>

# Sorting

As we have seen before, sorting our data allows us to view it in new ways. This can be useful in particular for sorting dates

In [None]:
# The function for sorting dataframes is .sort_values()
# You will need to specify which column you are sorting on by adding the 'by' parameter

df.sort_values(by='country').head()

In [None]:
# By default pandas will sort in ascending order, if we want it the other way we need to set ascending to False

df.sort_values(by='country',ascending=False).head()

In [None]:
# If you want to sort by more than one column, you add them as a list
# Note, the order is important here

df.sort_values(by=['country','UFO_shape']).head()

# Changing the Data

We have looked at several ways of viewing and exploring our data, but what if want to clean it? In this section we will look at a few ways we can change our data.

In [None]:
# First, let's look at adding a new column. 
# To do this you use the syntax df['new_column_name']=
# You could make the new column a single value for all rows
# You could also use a list or series, although it must be the same length as the dataframe itself 

df['Number']=1

df.columns

In [None]:
# To remove a column, you use .drop()
# This function works for removing both rows and columns, and is set to rows by default
# To remove a column you need to specify you are looking at axis 1 (columns)

df.drop('Number',axis=1).head()

Before we move on to other functions we need to discuss a specific property of pandas. When you apply a function to a dataframe you are (with a few exceptions) working on a <b>copy</b>, not the original dataframe itself. This means the outputs you are seeing are only being applied in that one instance. 

For example, in the last piece of code we removed the new column we created, but if you look at the list of columns below, you will see it is still there...

In [None]:
df.columns

The .drop() function only removed the column from the copy, not the original. To apply the function to the original, we need to add the parameter inplace=True.<br><br> Alternatively, you could just redefine the dataframe as this new edit.<br><br> i.e. `df=df.drop('Number',axis=1)`

In [None]:
df.drop('Number',axis=1,inplace=True)

Now if we look at our list of column names, we should see 'Number' is gone

In [None]:
df.columns

Let's now look at some more useful functions for cleaning/editing our data

In [None]:
# If we wanted to drop rows, we still use .drop() but don't add in axis=1
# We don't want this change to be permenant, so we'll leave out inplace=True

df.drop(0).head()

In [None]:
# There are a couple of methods for renaming columns
# This is a useful data cleaning step to ensure all column names are logical, concise and informative
# The first method is to simply pass a list the same length as the number of columns, with any edits made there

df.columns=['date_time', 'city', 'state_province', 'country', 'UFO_shape',
       'length_of_encounter_seconds', 'described_duration_of_encounter',
       'description', 'date_documented', 'latitude', 'longitude']

df.columns

This is one of the times that doesn't require inplace=True, when you run this code it will affect the original dataframe

In [None]:
# If I only wanted to change specific column names, I can use the .rename() function

df.rename(columns={'UFO_shape':'ufo_shape'}).head()

# We can use this function to rename multiple columns in the dictionary
# However, if we want it to stick we will need to add inplace=True

In [None]:
# You may have noticed that when you filter or sort the data the row indices do not reset. 
# The same thing happens if you delete rows, that specific index will be deleted, but everything else stays the same
# This can create issues later if you want to use .iloc, so it is a good idea to reset the index if it has been fractured

df.reset_index().head()

In [None]:
# Notice, this creates a new column called 'index', if you don't want this to be included you instead do:

df.reset_index(drop=True).head()

# Again, this funciton requires inplace=True for it to permenantly apply

In [None]:
# If we wanted to change a specific value within a dataframe (i.e. a cell), we can use .loc or .iloc

df.iloc[0,1]='San Marcos'

df.iloc[0,1]

# Note, .iloc and .loc are powerful functions that do not require inplace=True
# Any changes made this way are permenant
df.iloc[0,1]='san marcos' #just to change it back...

# Data Cleaning

We have so far looked at ways we can edit our dataframe in terms of tidying it up, but what about ways to clean up what is within the cells?

In [None]:
# Let's first look at changing datatypes, we would expect encounter lengths to be numeric, but the datatype is object
print(df.dtypes)

# If we wanted to change a data type, we can use .astype() to force the change
df.length_of_encounter_seconds.astype('float')

But there is a problem, some of the values are corrupted. The error shows us which value is the issue. Before we can change the datatype we will need to remove that value

In [None]:
# The function for replacing values is called .replace()
# The syntax is to include what you are replacing, and what it now should be
# In this instance because we are trying to remove a pesky `, we also need to add in .str to force the series to be a string

df.length_of_encounter_seconds.str.replace('`','')

In [None]:
# We then redfine the column so it now has this new series

df['length_of_encounter_seconds']=df.length_of_encounter_seconds.str.replace('`','')

In [None]:
# Now we can change the datatype

df['length_of_encounter_seconds']=df.length_of_encounter_seconds.astype('float')

df.dtypes

In [None]:
# Something that can plague our datasets in missing or null values, which pandas calls NaN (Not A Number)
# We saw earlier some funtions which shows us how many nulls there are
# .isna() will specifically tell you if a value is a null or not

df.isna().head()

In [None]:
# This however returns for each cell if it is a null or not
# If we want to actually count up how many there are we need to aggregate using .sum()
# This function adds up the values in a column (False=0, True=1)

df.isna().sum()

In [None]:
# But what do we do with nulls? 
# One thing we can do is remove them using .dropna()

print("with nulls")
print(df.shape)
print("without nulls")
print(df.dropna().shape)

# You can see that about 14000 rows have been removed

`.dropna()` is indiscriminate if you leave it as it is, but there are a few parameters that can make it more specific:<br><br>

<ul>
    <li><code>axis</code> is by default set to 0, so it will drop rows with nulls. If you want it to drop columns, set <code>axis=1</code></li>
    <li> It might be that you only want to remove rows/columns that are ALL nulls, so you would set <code>how='all</code></li>
    <li> You may only want to remove rows if the null values are in specific columns, you would then use <code>subset=[column_name]</code></li>
    <li> Again, this function only works on a copy, for it to be permenant you will need to use <code>inplace=True</code></li>
</ul>
<br>

Note, if a particular column is mostly nulls, it is better to drop the columns than use `.dropna()`.        


In [None]:
# Alternatively, instead of dropping nulls, you could replace them. There are three functions for this:
# .fillna() imputes a value you specifiy (useful if the null can be treated as 0)

df.fillna(0).head()

In [None]:
# .ffill() will replace each null with the last valid (non-null) value above it

df.ffill().head()

In [None]:
# .bfill() will replace each null with the first valid (non-null) value below it

df.bfill().head()

Whatever method you choose to remove null values is up to you, make sure the decision is reasonable and within the context of the situation you are working.

In [None]:
# Duplicate values can also cause us problems, to check if any rows are duplicates we can use:

df.duplicated()

In [None]:
# To get a count of how many rows are duplicates, we add .sum()

df.duplicated().sum()

In [None]:
# It might be though that you want to check for duplicates within specific columns
# We can use the subset parameter again to define which columns to check

df.duplicated(subset=['country']).sum()

In [None]:
# If we were to find duplicates, and wanted to remove them, we use the .drop_duplicates() function

df.drop_duplicates()

In [None]:
# This function by default only removes duplicates if the entire row is duplicated
# It also by default will keep the first instance and drop any subsequent
# Here is an example where we drop any row which contains a country already mentioned

df.drop_duplicates(subset='country')

# (Don't worry, we didn't include inplace=True, the change isn't permenant)

# Grouping Data

We now know some useful data cleaning tips, but what if we wanted to look at a summary of the data? This is where groupby functions come in

In [None]:
# The syntax for a groupby is to tell it which column to pivot on, and then what aggregate we want
# We can use mean, median, max, min, count, sum and std (standard deviation)

df.groupby('country').count()

In [None]:
# If we wanted to group by more than one column, we add them as a list

df.groupby(['country','UFO_shape']).count()

In [None]:
# By default the groupby will return all columns it can 
# (it won't return non-numeric columns if the aggregate requires a calculation )
# If you want specific columns, call them like you saw earlier

df.groupby(['country','UFO_shape']).count()[['city','state_province']]

# Exporting Data

Finally, how do you save your work? 

In [None]:
# Pandas has a function for exporting your data, which is useful if you want to load it into something else like Power BI
# It will save the file in the folder you opened the Jupyter notebook, unless you specify a file path
# DO NOT NAME YOUR CSV FILE AS THE ONE YOU IMPORTED
# This will overwrite your original data, which you should be keeping as a backup

df.to_csv('ufo_sightings.csv',index=False)

# Top tip, include index=False with this command.
# Otherwise, pandas will save the index as a new column

That ends this notebook, in the next one you will be given an opportunity to clean a dataset using the skills you learned here.