# Python Exercises 5

In this final, OPTIONAL exercises assignment, you will visualize data in several different ways, using Pandas, matplotlib, and a few other Python libraries. You will also do some basic data cleaning, joining, and manipulation.

Let's import some important stuff!

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## 1. Cloudbusting

Sometimes, the best way to visualize a text isn't with a chart, but with a different type of visualization. Let's try visualizing the frequency of words in the lyrics of your favorite song using a **word cloud**.

First, let's make sure we have everything installed and imported.

In [None]:
conda install wordcloud --yes

In [None]:
from wordcloud import WordCloud, STOPWORDS

Now that we've got that squared away, it's time to get your favorite song lyrics. Create a multi-line string containing the lyrics to your favorite song (or whatever song is stuck in your head right now). Don't worry, you won't be graded on your musical taste!

![never gonna](https://media.giphy.com/media/Ju7l5y9osyymQ/giphy-downsized.gif)

In [None]:
# create a string for your lyrics here

Then, we're going to clean the string and split it into an array of words. Using string methods, do the following:

* remove any punctuation from your lyric string, like commas, dashes, or semicolons
* make sure everything is lowercase

Try printing your results to make sure everything looks good!

In [None]:
# clean your string here

Then, create the wordcloud! You should use `STOPWORDS` (which I imported for you at the beginning of this question) to exclude common English words from your visualization. You can check out the lecture notes on data visualization for more details about how to configure your wordcloud.

Once you've created the wordcloud, use Matplotlib (`plt`) to show the image in your notebook!

In [None]:
# create your word cloud and plot it here

Beautiful!

## 2. A Scatter of Libraries!

Using the ```libraries_cleaned``` dataset, we are going to create a scatterplot from some of the columns. 

This data was taken from the [2014 Public Library Survey data](https://www.imls.gov/research-evaluation/data-collection/public-libraries-survey). Also, the data is fairly messy (as a lot of data can be) so it's been cleaned up a little bit already to make it easier to work with. 

If you follow the link, it will show you the documentation for the data and an explanation of each column. One major thing to note about this data is that each row is an aggregate count of data gathered from across mulitple libraries in one system. This means a lot of our data are very large numbers. 

But, we'll see that when we plot it!

### 2.1 Read the dataset into a Pandas Dataframe

Read the data into a dataframe and print the first 15 entries. 

In [None]:
# code here

### 2.2 Clean the Data! 

Our dataset has a lot of columns which we are not going to use. 

Create a new dataset with a new set of columns from the original one. 

Your dataset should be named ```library_data``` and contain the following columns: 

State, Library Name, State Code, Bookmobiles, Librarians, Print Collection, Library Programs, and Public Internet Computers

In [None]:
# create your new subset here

### 2.3 Make the plot! (Finally!)

Now that our dataset is all squeaky clean, we can go ahead and plot it. 

Create a scatterplot plotting the number librarians and the size of a library system's print collection. 

Create labels for your x and y axis and for your graph. If you want to experiment with the color and size of your plot points, that would be cool! 

In [None]:
# plot your data here

## 3. Barcharts and Bookmobiles! 

We want to create a barchart of the number of bookmobiles in each state. However, when we print the final barchart we want to have the name of the state be displayed on the x-axis and we currently don't have that information in our data! 

We can fix that by reading in another dataset and merging it with our library data. 

### 3.1 Read in the data!
Read the  ```us_state_ansi_fips``` dataset into a dataframe and call it ```state_codes```. You can explore this data if you want, it has three columns with state name, abbreviation, and fips code (number). 

In [None]:
# read state data here

### 3.2 Sum up the States! 

Using the ```.groupby``` on the ```State Code``` column sum up all of your data and save it to a new dataset called ```library_counts```

In [None]:
# code here!

### 3.3 Merge the data

Merge the ```library_counts``` and ```state_codes``` data together using an inner join.

**Hint:** The left join will use the ```left_index``` set to ```True``` because the column we want to join on is the index. The right join should be on the ```'st'``` column

In [None]:
# now we merge here

## 3.4 Plot it. 

Now that our data is merged and ready to go we can create a plot! 

Make a barchart that displays the number of bookmobiles in each state. Set the x-axis to the be the state name and y-axis to be the number of bookmobiles. 

Be sure to label you axis and title your plot and to make it big enough to read! You can change the size using ```plt.figure(figsize=())``` and setting ```figsize``` to the height and width you want!


**hint:** you can use this code: ```plt.xticks(rotation='vertical')``` to set your x-axis to be vertical and therefore a little easier to see. 

In [None]:
# make your plot here

## 4. Trees N'at

To close for today, let's get out in nature. You'll be working with the "City of Pittsburgh Trees" dataset for this question. 

Let's import the dataset from the WPRDC!

While you're waiting for this to import, you should check out the [dataset page](https://data.wprdc.org/dataset/city-trees) and [data dictionary](https://data.wprdc.org/dataset/city-trees/resource/d47d47da-5044-417c-a24d-8366fd7b1a09) for a little more context. There's a ton of information here!

In [None]:
# import and print the first few rows of the data
pgh_trees = pd.read_csv('https://data.wprdc.org/datastore/dump/1515a93c-73e3-4425-9b35-1cd11b2196da', low_memory=False)
pgh_trees.head()

Let's also clean this dataset a little bit by removing any rows that are missing their common name.

In [None]:
# get rid of any tree that does not have a common name
pgh_trees = pgh_trees[pgh_trees['common_name'].notna()].copy()

### 4.1 How Many Trees Are There?

Find how many unique types of trees there are in this dataset. You can use the common or scientific name to make this determination, but we will be plotting the data using the common name.

In [None]:
# your code here

That's way too many to visualize with a chart! 

### 4.2 Tree Trimming

You can see from looking at the common names of the trees that the type of tree is generally formatted like this: `Maple: Norwood`, with the general type of tree first, and the specific type after a colon. (Except for stumps. Poor stumps.)

Let's create a new column in the dataset by getting just the first part of the common name. Call this new column `tree_type`.

**hint**: you can use `.str.split(':').str[0]` in pandas to split strings wherever there's a colon and select the first part of the string!

In [None]:
# trim your trees here

### 4.3 Tree Pruning

Great! Now, let's see how many different trees there are in each type. Group your data by your new `tree_type` column and get your `.count()` values for each tree type, and assign it to a new series, `tree_counts`.

In [None]:
# code here

Looking good! But there are still 100+ tree types! Let's filter our counts for only those that have more than ten trees. Save that as a new variable, `filtered_tree_counts`.

**hint**: Try creating a mask for tree counts that are greater than 100.

In [None]:
# filter here

### 4.4 Plot the Trees!

Now, visualize these tree counts in whatever form seems best to you! Try changing up the colors and look/feel of your chart.

In [None]:
# make your plot here

Sure are a lot of maple trees, huh.