<a href="https://colab.research.google.com/github/jeremymcwilliams/python-dataviz-fti2021/blob/main/FTI2021.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Visualizations with python
## FTI 2021

### Goals of this workshop:  

* Load and filter datasets as needed using pandas
* Learn the basic elements of creating plots with pandas and matplotlib
* Repeat!

Note: the exercises here may be strikingly similar to the FTI "Data Visualizations with R" workshop.



---



Welcome to "Data Visualizations with python". We'll use this notebook to go through some examples, and then do some practice problems. 

This notebook is a mix of text cells and code cells. The text is simply descriptive of what we're doing, but the code cells let you write and execute python code. Below is an example of a code cell. To run the cell, click the 'play' icon on its left edge:


In [None]:
x=5
print(x)

Now use the blank code cell below to create a variable 'y' and set it equal to 20, and print it:

In [None]:
#enter your code below




As you examine the code blocks below, you'll want to make sure you run them. The code in this notebook is "procedural", meaning that it reads top to bottom. So a code block later in the notebook may not run properly if a preceding block isn't executed.

Ok! Now on to data visualizations. Before getting started, it's worth noting that python (and other languages) offers more than one way to work with data and create visualizations. If you ultimately want to try some different approaches, Google searches will undoubtedly lead you there. For the purposes of this session, we're going to use the "matplotlib" package.

We're going to take this general approach:

* Load a dataset
* Filter or format the data we want to use
* Create the visualization
* Edit as necessary
* Save the visualization as an image file
Before doing anything too fancy, we're going to first load some python libraries. Python out of the box can do a lot, but it can't do everything. Fortunately we can import libraries of functions created by the open source community to make our coding easier. The syntax to import a library is:

```
import libraryName as abbreviation
```
...where "abbreviation" is typically a very short word or acronymn that can be used later in our code to call functions in that library...basically to save a few keystrokes.

In [1]:
#pandas is a data analysis library
import pandas as pd

#pandas uses matplotlib "under the hood" to generate visualizations
import matplotlib.pyplot as plt

# Enable inline plotting
%matplotlib inline

### Average Height by Country

In our github repository, we've made a couple datasets available that we'll load. The first is "average-height-of-men.csv". This is a dataset from NCD-Risc (http://www.ncdrisc.org/data-downloads.html) that has the average annual height of men by country from 1896-1996. Let's load this data below by createing a variable called "men". We can then take a look at the data by printing "men".

In [None]:
men_url="https://raw.githubusercontent.com/jeremymcwilliams/python-dataviz-fti2021/main/average-height-of-men.csv"

# loads the data into the 'menDataFrame' variable
menDataFrame = pd.read_csv(men_url)

menDataFrame.head()



This looks pretty straightforward, though that last column heading is a bit of a handful. Let's use the pandas "columns" function to rename a couple:

In [None]:

menDataFrame.columns=['Country', 'Code', 'Year', 'Height']

menDataFrame.head()

Let's say we're interested in seeing any changes over time in the average height of men from the United States. We can use a pandas filtering expression:

In [None]:

usMen=menDataFrame[menDataFrame.Code.eq('USA')]

print(usMen)



Now that we have a subset of data, let's create a line plot:

In [None]:

# In the plot function, we set x and y to column names, and can add a title
# The plot() function returns an "axis" object (set equal to "ax") 
ax=usMen.plot(x="Year", y="Height", title="Average height of US men over time")

# We can now use the "ax" object to set x- and y-axis labels
ax.set_ylabel("Height, cm")
ax.set_xlabel("My custom label (years)")

# We rely on the matplotlib library (plt) to display and save the plot
# "gfc()" means "get current figure"
fig1 = plt.gcf()
#save as an image file, if desired
fig1.savefig('usMen.png', dpi=100)

#displays the plot below
plt.show()




Now let's compare multiple countries at once. We can use the "or" operater | in the filter function to return data for multiple countries:

In [None]:
# create a list of countries
countries = ['Kenya','Spain', 'Chile']

# create a new data frame in which the Country field has 
menKenSpaChi = menDataFrame[menDataFrame.Country.isin(countries)]

print(menKenSpaChi)



To plot them, we need to adjust/wrangle the data so each row is an entry, with the Year serving as the index. We can use the pandas "pivot" function to create a pivot table.

In [None]:


PVmenKenSpaChi = menKenSpaChi.pivot(index='Year', columns='Country', values='Height')
print(PVmenKenSpaChi)



Notice the difference between how the two data frames are structured. Now we can plot the data:

In [None]:
ax=PVmenKenSpaChi.plot(title="Average Men's height over time: Chile, Kenya, Spain")

ax.set_ylabel("Height(cm)")

#create file:
fig2 = plt.gcf()
fig2.savefig('chikenspaMen.png', dpi=100)

plt.show()


Now it's your turn! See if you can replicate the steps above for the data set of women's average height by country (https://raw.githubusercontent.com/jeremymcwilliams/python-dataviz-fti2021/main/average-height-of-women.csv), and create a plot showing average height over time for five countries. Break it into these steps:


*   Load and print the data set
*   Create a list/array of countries you want to examine
* Create a new data frame limited to those countries, and print it
* Pivot your data so it's ready to be plotted
* Plot your data
* Save a file of your plot





In [None]:
#Your turn









### Chicken Weights

Now we're going to shift gears and look at a data set from a 1977 study examining the weights of chickens in different groups based upon their diets:

In [None]:
chicken_url="https://forge.scilab.org/index.php/p/rdataset/source/file/master/csv/datasets/chickwts.csv"

chickens=pd.read_csv(chicken_url)


print(chickens)

This is a pretty simple data set, with each observation (feed type and weight) represented by a row. Now let's say we want to create a bar chart displaying the mean values of each feed type. Once again, we'll need to wrangle our data so it shows each feed type along with its mean. For this, we can use the pandas "groupby" function:

In [None]:


# Here we're grouping the feeds together, and indicating we want them listed with the mean of the weight column
feedMeans = chickens.groupby('feed')['weight'].mean()
print(feedMeans)







#df.groupby('col1').agg('weight':['std','mean'])

Now with our data in the right format, we can use plot.bar to generate a bar plot, along with x- and y-axis labels, and a title: 

In [None]:
dfg.plot.bar( ylabel='Average Weight(g)', title="Average weight of chickens by feed type", xlabel="Feed Type")
fig3 = plt.gcf()
fig3.savefig('feedweights.png', dpi=100)


plt.show()


Now it's your turn. Use the iris data set (https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv) to create a bar chart displaying the mean Petal length by Species.

Steps:


*   Load and print the data set
*   Use groupby to restructure your data
* Display and save the bar chart using the restructured data.



In [None]:
#Your Turn:












Further reading: check out the wide variety of plots one can create with pandas: https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html