<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

Created by [Nathan Kelber](http://nkelber.com) for
Wayne State's Data Science Strategy and Leadership Course

[Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email nathan.kelber@ithaka.org.<br />
___

**Pandas Data Cleaning Assignment**

The data is available at: https://data.detroitmi.gov/datasets/blight-violations/data in CSV format. The dataset has 482,497 rows and the file size is 225 mb in size. This is about half the maximum number of rows in Excel (1,048,576 rows). I recommend checking out the data in Excel before attempting to manipulate it in Pandas. The operations you will do for this assignment can all be executed in Excel, but learning to do them in Pandas will unlock a few benefits for your future work:

* Pandas can operate on datasets that are much larger, millions of rows
* Faster manipulations
* Automating a series of transformations for new batches of data
* Opening up advanced Pandas methods like regular expression matching, melts, and pivots that are difficult or impossible in Excel

**Business Proposition**
This assignment uses public data to discover possible clients for a snow removal business, but the methods could be used to locate clients in any number of fields. For example, the city of Detroit tracks data on [liquor licenses](https://data.detroitmi.gov/datasets/liquor-licenses/data?) which might be used to market liquor or security products. Some other datasets include:

* [Licensed Professionals](https://data.detroitmi.gov/datasets/licensed-professionals)
* [Restaurant Inspections](https://data.detroitmi.gov/datasets/restaurant-inspections)
* [911 Calls for Service for the last 30 days](https://data.detroitmi.gov/datasets/911-calls-for-service-last-30-days-1)
* [City Payments (Open Checkbook)](https://data.detroitmi.gov/datasets/open-checkbook-payments)
* [Residential Demolitions](https://data.detroitmi.gov/datasets/completed-residential-demolitions/data?)

How can you imagine this data might be useful to local businesses?

# Challenge: Import Blight Data from the Detroit Open Data Portal

Download the data from https://data.detroitmi.gov/datasets/blight-violations/data in CSV format. You can see the file structure of this Jupyter environment by using menu: "File > Open". Then choose upload to upload the document to the environment. 

Alternatively, you can use the library `urllib` to pull in the data automatically. The direct link to the CSV file is: https://opendata.arcgis.com/datasets/fe2f692918a04c13a6cead436e7eaec9_0.csv

Can you adapt the code in the following cell to use that URL?

In [None]:
import requests

url = 'https://example.com/data.csv'
r = requests.get(url, allow_redirects=True)

open('Blight_Violations.csv', 'wb').write(r.content)

## Opening, Reading, and Writing CSV Files (.csv)
CSV file data can be easily opened, read, and written using the `pandas` library. (For large CSV files (>500 mb), you may wish to use the `csv` library to read in a single row at a time to reduce the memory footprint.) Pandas is flexible for working with tabular data, and the process for importing and exporting to CSV is simple.

Adapt the Pandas CSV read method below to our file name.

In [None]:
# Import pandas 
import pandas as pd

# Create our dataframe
df = pd.read_csv('example_file.csv', low_memory=False)

We can confirm our data has been read in with `.shape` which gives us the number of rows and columns in our dataframe.

In [None]:
# Use `.shape` to find rows and columns in the DataFrame
df.shape

In [None]:
# Preview the first 10 rows in our dataframe
df.head(10)

In [None]:
# Preview the last 5 rows in our dataframe
df.tail(5)

# Challenge: Drop Irrelevant Columns

We are primarily interested in discovering where snow removal is an issue. There is a lot of data in this dataset that is not relevant to our analysis. First, let's drop all columns except:

* ticket_id
* violator_name
* mailing_address_str_number
* mailing_address_str_name
* city
* state
* zip_code
* violation_date
* violation_address
* violation_description

In [None]:
# Redefining the dataframe to certain columns
df = df.loc[:, ['column_1', 'column_2', 'column_3']]

In [None]:
# List the name of all columns
df.columns

In [None]:
# Printing out the first five rows of the new dataframe
df.head()

# Challenge: Standardize the 'city' Column

Our city column has a lot of variation in the names for cities. We can see that there are a ton of different city names. The next code cell shows how many.

In [None]:
# Display the number of unique values in the city column
unique_names = df['city'].unique()
print(len(unique_names))

We can see the scope of the problems by printing out our list.

In [None]:
list(unique_names)

For a problem of this magnitude, we could use libraries like FuzzyWuzzy and Levenshtein to do fuzzy text matching to an established list of Michigan cities. We could also use a tool like OpenRefine. That's beyond the scope of our assignment though. Let's just focus on fixing the entries for Detroit which is a challenge in itself with variations including:

* 'Det'
* 'DEt'
* 'detroit'
* 'DETROIT'
* 'Detroit'
* 'det'
* 'det.'
* 'DETRIT'
* 'DETOIT'
* 'Deroit'
* 'Dertoit'

By the way, you can [buy 'Dertroit Beisbolcats' athletic wear](https://www.beisbolcats.com/collections/all). We can do a create a filter here using regular expressions to clean up many problems by matching all strings that begin with 'det'. 

In [None]:
# Create a filter that matches 'det'
detroit_filter = df['city'].str.contains('det', case=False, na=False)

Now that you have a filter, can you discover how to change the 4714 relevant values in the 'city' column to the string 'Detroit'.

In [None]:
# Change the relevant values in the 'city' column
# To become 'Detroit'

In [None]:
# Confirm the changes to the 'city' column
df.head(15)

# Challenge: Find all rows where 'violation_description' includes 'snow removal'

We want to use the data in these blight violations to target business owners that could use a reliable snow removal service. Most of the violations here are not relevant for that purpose. Let's remove any that do no mention snow removal. Can you modify the next cell to our task?

In [None]:
# Create a filter that will find snow removal violations
snow_filter = df['column_name'].str.contains('string', case=False)

In [None]:
# Sum up the number of True values for snow_filter
snow_filter.sum()

In [None]:
# Apply the snow_filter to remove all non-relevant violations
df = df[snow_filter]

In [None]:
# Preview our final, cleaned dataset
df.head()

# Bonus Challenge: Create a full address column
This challenge is not required for full credit but can you use the data from these columns:

* violator_name
* mailing_address_str_number
* mailing_address_str_name
* city
* state
* zip_code

to create a new column called 'mailing_address' so that each violator's address could easily be printed onto a snow removal service brochure that would be sent by mail?

# Challenge: Output Your Cleaned Data

After you've made any necessary changes in Pandas, write the dataframe back to the CSV file. (Remember to always back up your data before writing over the file.) Update the file name here to describe your final, cleaned dataset.

In [None]:
# Write data to new file
# Keeping the Header but removing the index
df.to_csv('your_file_name.csv', header=True, index=False)

# Turning in your Notebook
Download an HTML version of your notebook from the file menu (File > Download as > HTML (.html). Send the file to Professor Kelber in an email (nkelber@gmail.com) or through Canvas. Make sure you have successfully run all the code cells in the notebook. You do not need to send your CSV output file. 