# Project: Investigate a Dataset: US census data on guns

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

> **Tip**: In this section of the report, provide a brief introduction to the dataset you've selected for analysis. At the end of this section, describe the questions that you plan on exploring over the course of the report. Try to build your report around the analysis of at least one dependent variable and three independent variables.
>
> If you haven't yet selected and downloaded your data, make sure you do that first before coming back here. If you're not sure what questions to ask right now, then make sure you familiarize yourself with the variables and the dataset context for ideas of what to explore.

In [99]:
#Imports 
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np 
%matplotlib inline 

<a id='wrangling'></a>
## Data Wrangling

> **Tip**: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you document your steps carefully and justify your cleaning decisions.

### Opening Files 

In [100]:
#opening csv files 
df_census = pd.read_csv('U.S. Census Data.csv')
xl_ncis = pd.read_excel('gun_data.xlsx')

In [101]:
#converting the excel file to csv
xl_ncis.to_csv('gun_data.csv', encoding = 'utf-8', index= False)

#creating a new dataframe for the gun data
df_ncis = pd.read_csv('gun_data.csv')

### Quick snapshots of the dataframes 

In [102]:
#View the first 5 rows of the census file
df_census.head()

Unnamed: 0,Fact,Fact Note,Alabama,Alaska,Arizona,Arkansas,California,Colorado,Connecticut,Delaware,...,South Dakota,Tennessee,Texas,Utah,Vermont,Virginia,Washington,West Virginia,Wisconsin,Wyoming
0,"Population estimates, July 1, 2016, (V2016)",,4863300,741894,6931071,2988248,39250017,5540545,3576452,952065,...,865454.0,6651194.0,27862596,3051217,624594,8411808,7288000,1831102,5778708,585501
1,"Population estimates base, April 1, 2010, (V2...",,4780131,710249,6392301,2916025,37254522,5029324,3574114,897936,...,814195.0,6346298.0,25146100,2763888,625741,8001041,6724545,1853011,5687289,563767
2,"Population, percent change - April 1, 2010 (es...",,1.70%,4.50%,8.40%,2.50%,5.40%,10.20%,0.10%,6.00%,...,0.063,0.048,10.80%,10.40%,-0.20%,5.10%,8.40%,-1.20%,1.60%,3.90%
3,"Population, Census, April 1, 2010",,4779736,710231,6392017,2915918,37253956,5029196,3574097,897934,...,814180.0,6346105.0,25145561,2763885,625741,8001024,6724540,1852994,5686986,563626
4,"Persons under 5 years, percent, July 1, 2016, ...",,6.00%,7.30%,6.30%,6.40%,6.30%,6.10%,5.20%,5.80%,...,0.071,0.061,7.20%,8.30%,4.90%,6.10%,6.20%,5.50%,5.80%,6.50%


In [103]:
df_ncis.head()

Unnamed: 0,month,state,permit,permit_recheck,handgun,long_gun,other,multiple,admin,prepawn_handgun,...,returned_other,rentals_handgun,rentals_long_gun,private_sale_handgun,private_sale_long_gun,private_sale_other,return_to_seller_handgun,return_to_seller_long_gun,return_to_seller_other,totals
0,2017-09,Alabama,16717.0,0.0,5734.0,6320.0,221.0,317,0.0,15.0,...,0.0,0.0,0.0,9.0,16.0,3.0,0.0,0.0,3.0,32019
1,2017-09,Alaska,209.0,2.0,2320.0,2930.0,219.0,160,0.0,5.0,...,0.0,0.0,0.0,17.0,24.0,1.0,0.0,0.0,0.0,6303
2,2017-09,Arizona,5069.0,382.0,11063.0,7946.0,920.0,631,0.0,13.0,...,0.0,0.0,0.0,38.0,12.0,2.0,0.0,0.0,0.0,28394
3,2017-09,Arkansas,2935.0,632.0,4347.0,6063.0,165.0,366,51.0,12.0,...,0.0,0.0,0.0,13.0,23.0,0.0,0.0,2.0,1.0,17747
4,2017-09,California,57839.0,0.0,37165.0,24581.0,2984.0,0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,123506


> We notice several issues with these dataframes, particularly the census data, which seems to have null values, ill-formatted strings and column names as well as numeric data with commas. Lets take a closer look. 

> Another important detail we see is that the census data and the gun data are linked by the different states. To make the exploratory data analysis process easier, we could try flipping the rows and columns of the census data and group by state. 

### Cleaness Checks 
#### Census data 

In [104]:
#check for missing values 
df_census.isnull().sum()

Fact               5
Fact Note         57
Alabama           20
Alaska            20
Arizona           20
Arkansas          20
California        20
Colorado          20
Connecticut       20
Delaware          20
Florida           20
Georgia           20
Hawaii            20
Idaho             20
Illinois          20
Indiana           20
Iowa              20
Kansas            20
Kentucky          20
Louisiana         20
Maine             20
Maryland          20
Massachusetts     20
Michigan          20
Minnesota         20
Mississippi       20
Missouri          20
Montana           20
Nebraska          20
Nevada            20
New Hampshire     20
New Jersey        20
New Mexico        20
New York          20
North Carolina    20
North Dakota      20
Ohio              20
Oklahoma          20
Oregon            20
Pennsylvania      20
Rhode Island      20
South Carolina    20
South Dakota      20
Tennessee         20
Texas             20
Utah              20
Vermont           20
Virginia     

> We notice that there are a lot of missing values in this dataset. Firstly, the column 'Fact Note' has 57 missing values. Instead of dropping these rows, we can replace all the null values in this column by the string 'None'. All the other columns have 20 null values. This could mean that all the null values occur on the same 20 rows, and that these rows should be dropped from the table. 

In [105]:
#Check for duplicate rows
sum(df_census.duplicated())

3

> There are 3 duplicate rows in this dataframe. We will drop these rows in the data cleaning part

In [106]:
#Check the datatypes
df_census.dtypes

Fact              object
Fact Note         object
Alabama           object
Alaska            object
Arizona           object
Arkansas          object
California        object
Colorado          object
Connecticut       object
Delaware          object
Florida           object
Georgia           object
Hawaii            object
Idaho             object
Illinois          object
Indiana           object
Iowa              object
Kansas            object
Kentucky          object
Louisiana         object
Maine             object
Maryland          object
Massachusetts     object
Michigan          object
Minnesota         object
Mississippi       object
Missouri          object
Montana           object
Nebraska          object
Nevada            object
New Hampshire     object
New Jersey        object
New Mexico        object
New York          object
North Carolina    object
North Dakota      object
Ohio              object
Oklahoma          object
Oregon            object
Pennsylvania      object


> **Tip**: You should _not_ perform too many operations in each cell. Create cells freely to explore your data. One option that you can take with this project is to do a lot of explorations in an initial notebook. These don't have to be organized, but make sure you use enough comments to understand the purpose of each code cell. Then, after you're done with your analysis, create a duplicate notebook where you will trim the excess and organize your steps so that you have a flowing, cohesive report.

> **Tip**: Make sure that you keep your reader informed on the steps that you are taking in your investigation. Follow every code cell, or every set of related code cells, with a markdown cell to describe to the reader what was found in the preceding cell(s). Try to make it so that the reader can then understand what they will be seeing in the following cell(s).

### Data Cleaning 
#### Census Data 

In [107]:
#Reformat column names to all lowercase and replace spaces with underscores. 
df_census.rename(columns = lambda x: x.lower().replace(' ','_'), inplace=True)

#Create a list of state names, remove first 2 elements in the list. 
states = df_census.columns[2:]

In [108]:
#Fill missing values in 'fact_note' column with the string 'none'
df_census['fact_note'].fillna('none', inplace= True )

In [109]:
#Drop rows with missing values 
df_census.dropna(axis=0, inplace=True  )

In [111]:
#Drop duplicated rows
df_census.drop_duplicates(inplace=True, ignore_index=True )

In [112]:
#flip axis
df_census = df_census.transpose()

KeyError: '[\'Population estimates, July 1, 2016,  (V2016)\'\n \'Population estimates base, April 1, 2010,  (V2016)\'\n \'Population, percent change - April 1, 2010 (estimates base) to July 1, 2016,  (V2016)\'\n \'Population, Census, April 1, 2010\'\n \'Persons under 5 years, percent, July 1, 2016,  (V2016)\'\n \'Persons under 5 years, percent, April 1, 2010\'\n \'Persons under 18 years, percent, July 1, 2016,  (V2016)\'\n \'Persons under 18 years, percent, April 1, 2010\'\n \'Persons 65 years and over, percent,  July 1, 2016,  (V2016)\'\n \'Persons 65 years and over, percent, April 1, 2010\'\n \'Female persons, percent,  July 1, 2016,  (V2016)\'\n \'Female persons, percent, April 1, 2010\'\n \'White alone, percent, July 1, 2016,  (V2016)\'\n \'Black or African American alone, percent, July 1, 2016,  (V2016)\'\n \'American Indian and Alaska Native alone, percent, July 1, 2016,  (V2016)\'\n \'Asian alone, percent, July 1, 2016,  (V2016)\'\n \'Native Hawaiian and Other Pacific Islander alone, percent, July 1, 2016,  (V2016)\'\n \'Two or More Races, percent, July 1, 2016,  (V2016)\'\n \'Hispanic or Latino, percent, July 1, 2016,  (V2016)\'\n \'White alone, not Hispanic or Latino, percent, July 1, 2016,  (V2016)\'\n \'Veterans, 2011-2015\' \'Foreign born persons, percent, 2011-2015\'\n \'Housing units,  July 1, 2016,  (V2016)\' \'Housing units, April 1, 2010\'\n \'Owner-occupied housing unit rate, 2011-2015\'\n \'Median value of owner-occupied housing units, 2011-2015\'\n \'Median selected monthly owner costs -with a mortgage, 2011-2015\'\n \'Median selected monthly owner costs -without a mortgage, 2011-2015\'\n \'Median gross rent, 2011-2015\' \'Building permits, 2016\'\n \'Households, 2011-2015\' \'Persons per household, 2011-2015\'\n \'Living in same house 1 year ago, percent of persons age 1 year+, 2011-2015\'\n \'Language other than English spoken at home, percent of persons age 5 years+, 2011-2015\'\n \'High school graduate or higher, percent of persons age 25 years+, 2011-2015\'\n "Bachelor\'s degree or higher, percent of persons age 25 years+, 2011-2015"\n \'With a disability, under age 65 years, percent, 2011-2015\'\n \'Persons  without health insurance, under age 65 years, percent\'\n \'In civilian labor force, total, percent of population age 16 years+, 2011-2015\'\n \'In civilian labor force, female, percent of population age 16 years+, 2011-2015\'\n \'Total accommodation and food services sales, 2012 ($1,000)\'\n \'Total health care and social assistance receipts/revenue, 2012 ($1,000)\'\n \'Total manufacturers shipments, 2012 ($1,000)\'\n \'Total merchant wholesaler sales, 2012 ($1,000)\'\n \'Total retail sales, 2012 ($1,000)\' \'Total retail sales per capita, 2012\'\n \'Mean travel time to work (minutes), workers age 16 years+, 2011-2015\'\n \'Median household income (in 2015 dollars), 2011-2015\'\n \'Per capita income in past 12 months (in 2015 dollars), 2011-2015\'\n \'Persons in poverty, percent\' \'Total employer establishments, 2015\'\n \'Total employment, 2015\' \'Total annual payroll, 2015 ($1,000)\'\n \'Total employment, percent change, 2014-2015\'\n \'Total nonemployer establishments, 2015\' \'All firms, 2012\'\n \'Men-owned firms, 2012\' \'Women-owned firms, 2012\'\n \'Minority-owned firms, 2012\' \'Nonminority-owned firms, 2012\'\n \'Veteran-owned firms, 2012\' \'Nonveteran-owned firms, 2012\'\n \'Population per square mile, 2010\' \'Land area in square miles, 2010\'\n \'FIPS Code\'] not found in axis'

In [None]:
df_census.head()

<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

### Research Question 1 (Replace this header name!)

In [None]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.


### Research Question 2  (Replace this header name!)

In [None]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work, you should save a copy of the report in HTML or PDF form via the **File** > **Download as** submenu. Before exporting your report, check over it to make sure that the flow of the report is complete. You should probably remove all of the "Tip" quotes like this one so that the presentation is as tidy as possible. Congratulations!