*Edited: 2022-09-13*

# BIP Python workshop 2

This week we start using the Pandas library for managing data.

The Pandas library is used with tabular data, and allows us to easily and conveniently load structured datasets, such as CSV files without having to manually parse the file the way we would with a messier data file.

## Importing Pandas

When we use Pandas, we the first thing we will need to do is import the Pandas library. We only need to do this once, and usually we will do this at the very top of our notebook.

It is conventional to import Pandas not as the default `pandas` but as `pd`. You will see this in many text books and tutorials.

In [None]:
import pandas as pd

## Reading CSV Files

Now that we have imported the library, we can create data frames. As you saw in the lesson videos, we can create a data frame using the data frame constructor.

However, it is far more common to load data from a file, so that is how we will load our data in this lab.

First we will load the dataset `dogs.csv`.

To display the dataset we justed loaded, we will write the variable name on a new line.

In [None]:
df = pd.read_csv('dogs.csv')

df

In [None]:

print(df.shape)
print('this table has ',df.shape[0], ' rows x ',df.shape[1], 'columns')

 Python has many data visualisation libraries. One of the easiest to use is Plotly Express. In this module we will just be touching on the basics of visualisation in Python.

We will be importing pandas and plotly express. This is usually going to be done once at the top of our notebook.

**If the import for Plotly gives an error such as "No module named plotly" then refer to the installation instructions.**

In [None]:
import pandas as pd
import plotly.express as px

To test that Plotly is working, we're just going to run the first example from the Plotly documentation.

Run the code below to render a scatter plot.

**If you are using JupterLab and the plot area is completely blank then refer to the installation instructions.**

**If you are still having issues and can't get Plotly to work at all in JupyterLab, try it in Jupyter Notebook, since that seems to work more reliably**

In [None]:
df = pd.read_csv('dogs.csv')
print(df['UK Registrations'])
px.scatter(df, x="UK Registrations", y="Weight (kg)", log_x=True, size="UK Registrations",color="Breed Name")

Now we know how to create plots in Plotly Express. You can also create 2D and 3D scatter plots, bar charts, and animated 2D scatter plots.

It is very important to be able to refer to the official Plotly documentation when trying to build a plot if you don't have all of the dozens of plot types and hundreds of function arguments memorized.

So always read the documentation while building a plot.

In [None]:
px.bar(df, x="UK Registrations", y="Breed Name", color="Breed Name")

In [None]:
px.line(df, x="UK Registrations", y="Breed Name")

In [None]:
px.pie(df,values="UK Registrations",names="Breed Name",title="UK registration per breed")

In [None]:
px.histogram(df, x="Weight (kg)",color='Breed Name')

Let's check out on Beer and Cider dataset!

In [None]:
df = pd.read_csv('Beer and Cider Alcohol Content'+'.csv')

df

In [None]:
px.scatter(df, x="brand", y="style",  color="brand")

In [None]:
px.bar(df,  x="brand", y="style", color="brand")

In [None]:
 
px.line(df,x="product", y="units_per_100ml")

In [None]:
df=pd.read_csv("Beer and Cider Alcohol Content.csv")
count=[]
for i in range (df['brand'].size): 
    brand_df=df.loc[df.brand==df['brand'][i]]
    c=brand_df['product'].size
    count.append(c)
df.insert(0,'product_per_brand','')
df['product_per_brand'] = count
px.pie(df.head(20), values='product_per_brand', names='brand', title='product_per_brand')


In [None]:
px.pie(df.tail(20), values='product_per_brand', names='brand', title='product_per_brand')


**Another way to load data?**
Yes!
You can also read data from online sources!

In [None]:
import urllib.request
import pandas as pd

url = "https://raw.github.com/datasets/gdp/master/data/gdp.csv" 
df=pd.read_csv(url)
df 

**How to filter data by row value?**
By Column, you can filter data simply just provide column name
By Row, You can filter data by provide row value you want to see 


In [None]:
import plotly.express as px
#Here let's check what columns we have 
columns=df.columns.tolist()
print(df.columns)
#Now you can chose one columns you want to see
print(set(df[columns[2]].tolist()))
#Now check the original data with a line graph
fig=px.line(df,x='Year',y='Value',  color='Country Name',  symbol="Country Name")
fig.show()

Now try to plot with filtered data 
Let's start with only plot Arab World GDP information

In [None]:
#filter data in column Country Name to Arab World only
new_df=df[df['Country Name']=='Arab World']
#check your filtered data
print(new_df)
#visual your filtered data
fig=px.bar(new_df,x="Year", y="Value",  color='Value')
fig.show()
fig=px.scatter(new_df, x="Year", y="Value", color='Value')
fig.show()
px.line(new_df,x='Year',y='Value',  color='Country Code',symbol='Country Name')


Now try to plot with filtered data again 
Let's try with only plot 2016 global GDP information

In [None]:
#filter data in column Year to value 2016 only
new_df=df[df['Year']==2016]
print(new_df)
# print(len(df[~df.duplicated('Country Code')]['Country Code']))


In [None]:
#Now try to visual this data
fig=px.pie(new_df.head(20),values='Value',names='Country Name',color='Country Name')
fig.show()
fig=px.bar(new_df.head(20),x='Country Name',y='Value',  color='Country Name')
fig.show()

**Now try to sort your data**
Start with sort by GDP values
Compare with the chart above, can you tell the difference?

In [None]:
#First, we filter data in column Year to value 2016 only
new_df=df[df['Year']==2016]
print(new_df)
#Now, try to sort the data by GDP values
new_df=new_df.sort_values(by=['Value'])
print(new_df)
#compare the data, have notice the difference?

In [None]:
# Now Let's visual the sorted data, can you tell the difference now?
fig=px.pie(new_df.head(20),values='Value',names='Country Name',color='Country Name')
fig.show()
fig=px.bar(new_df.head(20),x='Country Name',y='Value',  color='Country Name')
fig.show()

**Task** Try to load your own data

**Hint:** There is more than one way to do this, choose which you prefer.

In [None]:
my_file_name='?'
my_df = pd.read_csv('my_file_name'+'.csv')

my_df