# Project: Investigate a Dataset (FBI Gun Data)

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

Introduction 

This dataset contains the number of FBI firearm background checks initiated through the FBI's National Instant Criminal Background Check System (NICS).The NICS data is used to determine whether a prospective buyer is eligible to buy firearms or explosives. Gun shops call into this system to ensure that each customer does not have a criminal record or isn’t otherwise ineligible to make a purchase. The data has been supplemented with state level data from census.gov.

https://www.fbi.gov/services/cjis/nics
https://github.com/BuzzFeedNews/nics-firearm-background-checks/blob/master/README.md




Questions:

What state has the highest total of gun registrations?

What is the overall trend of gun purchases?

What state has the highest guns per capita?




In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib  inline 

<a id='wrangling'></a>
## Data Wrangling

> In this section, I will load in the data, check for cleanliness, and then trim and clean my dataset for analysis. 


### General Properties


The NICS data is found in one sheet of an .xlsx file. It contains the number of firearm checks by month, state, and type.
The U.S. census data is found in a .csv file. It contains several variables at the state level. Most variables just have one data point per state (2016), but a few have data for more than one year.

In [None]:
# Load your data and print out a few lines. Perform operations to inspect data

df_guns = pd.read_excel('gun-data.xlsx')

df_us_census = pd.read_csv('U.S. Census Data.csv', sep =',')

#### Guns Data

In [None]:
df_guns.head(5) 

In [None]:
df_guns.info()  # this displays a concise summary of the dataframe,
                # including the number of non-null values in each column

In [None]:
df_guns.isnull().sum()  # check for missing value count for each column

In [None]:
df_guns.describe() # check summary statistics 

In [None]:
df_guns.duplicated().sum() # check for duplicate data

#### Census Data

In [None]:
df_us_census.head(5) # columns and rows should be swapped 

In [None]:
df_us_census.info() # 2 columns have nulls, state columns should be floats or ints

In [None]:
df_us_census['Fact'].unique()  # check unique rows 

In [None]:
df_us_census['Fact Note'].unique() # this column can be dropped since the information is not useful for this analysis

In [None]:
print(df_us_census.duplicated().sum())# check for duplicates


> **Tip**: Make sure that you keep your reader informed on the steps that you are taking in your investigation. Follow every code cell, or every set of related code cells, with a markdown cell to describe to the reader what was found in the preceding cell(s). Try to make it so that the reader can then understand what they will be seeing in the following cell(s).

### Data Cleaning 

For consistency I'm going to drop states/territories that are not found in both datasets, strip spaces in columns, and convert all columns to lowercase. I'm going to clean and further check the data for missing data, incorrect data types, and duplicates.

##### FBI Guns Data

Note: Sales estimates are calculated from handgun, long gun and 
multiple-gun background checks.

In [None]:
df_guns["state"] = df_guns["state"].str.lower()  #made the state column lowercase


In [None]:
#incorrect data types 

#timestamps are represented as strings instead of datetime 
df_guns['month'] = pd.to_datetime(df_guns.month, format= "%Y-%m")

In [None]:
df_guns.info()

In [None]:
# checking state column 

df_guns_q1['state'].unique()  # Guam ,Mariana Islands, Puerto Rico, Virgin Islands, District of Columbia aren't in the census data

In [None]:
# filter and drop Guam ,Mariana Islands, Puerto Rico, Virgin Islands, District of Columbia rows since they

df_guns.drop = df_guns.query('state != "guam"',inplace=True)
df_guns.drop = df_guns.query('state != "mariana islands"',inplace=True)
df_guns.drop = df_guns.query('state != "puerto rico"', inplace=True)
df_guns.drop = df_guns.query('state != "virgin islands"',inplace=True)
df_guns.drop = df_guns.query('state != "district of columbia"',inplace=True)

df_guns['state'].unique()

#### Creating a new df for question 1 to drop columns  - this isnt ideal 

In [None]:
df_guns_q1 = df_guns #making a new df to join with census

https://stackoverflow.com/questions/51070985/find-out-the-percentage-of-missing-values-in-each-column-in-the-given-dataset

In [None]:
# separate into month and year
#df_guns_q1['year'] = df_guns_q1['month'].dt.year

In [None]:
#missing data - I wanted to see the percentage of missing data in the guns dataset for every column
percent_missing = df_guns_q1.isnull().sum() * 100 / len(df_guns_q1)
missing_values = pd.DataFrame({'column_name': df_guns_q1.columns,
                                 'percent_missing': percent_missing})

missing_values

In [None]:
df_guns_q1 = df_guns_q1.reset_index(drop=True)

In [None]:
# drop columns with high volume of nulls and for analysis
df_guns_q1.drop(['permit','permit_recheck','other','admin','prepawn_handgun','prepawn_long_gun','prepawn_other','redemption_other', 'redemption_handgun','redemption_long_gun','returned_other','rentals_handgun','rentals_long_gun','private_sale_handgun','private_sale_long_gun','private_sale_other','return_to_seller_handgun','return_to_seller_long_gun','return_to_seller_other','returned_handgun','returned_long_gun'], axis=1, inplace=True)

df_guns_q1.info()

In [None]:
# display a histogram of - totals, handgun, and long_gun all seem to be skewed to the right 

df_guns_q1.hist(figsize=(10,8));

##### Data Cleaning: Census Data

● The U.S. census data is found
in a .csv file. It contains several
variables at the state level. Most
variables just have one data
point per state (2016), but a few
have data for more than one
year.


In [None]:
df_us_census.T #swap columns with rows to join with the FBI guns data # need to figure out where to do this 

In [None]:
df_us_census_2 = df_us_census.T
df_us_census_2.columns = df_us_census_2.loc['Fact']
df_us_census_2.drop(['Fact','Fact Note'],inplace=True)

df_us_census_2

In [None]:
# make lowercase and replace commas with _ and remove spaces
df_us_census_2.columns = [str(x).lower().replace(',','_').replace(' ','') for x in df_us_census_2.columns]

df_us_census_2

https://knowledge.udacity.com/questions/428050

In [None]:
df_us_census_2.info() # check index to drop columns

In [None]:
# Remove all columns between column index 3 to 86 since we're only going to look at 2016 and 2010 population

df_us_census_2.drop(df_us_census_2.iloc[:, 3:86], inplace = True, axis = 1)


df_us_census_2.info()


In [None]:
df_us_census_2.head(80)

https://github.com/malaklm/solution/blob/master/US%20Census%20data.ipynb

In [None]:
# replace commas in dataset to convert to floats #not sure if this will work
df_us_census_2.replace({",": ''}, regex=True,inplace=True)
df_us_census_2.head()

In [None]:
# convert strings to floats 
df_us_census_2['populationestimates_july1_2016_(v2016)'] = pd.to_numeric(df_us_census_2['populationestimates_july1_2016_(v2016)'],errors= 'coerce',downcast='float')
df_us_census_2['populationestimatesbase_april1_2010_(v2016)'] = pd.to_numeric(df_us_census_2['populationestimatesbase_april1_2010_(v2016)'],errors= 'coerce',downcast='float')

df_us_census_2.dtypes # check if data type conversion worked

In [None]:
df_us_census.head(20)

<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

What is the overall trend of gun purchases?

What state has the highest growth in gun registrations?


What state has the highest guns per capita?


### Research Question 1: What is the overall trend of gun purchases?

https://seaborn.pydata.org/examples/timeseries_facets.html

https://stackoverflow.com/questions/65300109/generating-a-line-graph-using-seaborn-or-matplotlib-with-year-as-hue-month-as

In [None]:
sns.set_theme(style="darkgrid")


guns_overtime = df_guns_q1.groupby(['month'])['totals'].sum()


overtime_fig= sns.lineplot(data=guns_overtime, palette="crest")
overtime_fig.set_title('Total # of Gun Permits')

### Research Question 2: What state has the highest volume of gun registrations?

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html

In [None]:
df_guns_q1_totals = df_guns_q1.groupby(['state']).sum().sort_values(by='totals', ascending=False).head(5)
df_guns_q1_totals


# use one without year


In [None]:
# ax = sns.barplot(x ="state", y="totals", data=df_guns_q1_totals)

### Research Question 3: What state had the highest per capita sales in 2016?

In [None]:
#Get all the 2010 data
guns_2010 = df_guns[df_guns.month == '2010-07']
guns_2010.head(5)

In [None]:
guns_2010.set_index('state',inplace=True,drop=True)
guns_2010

In [None]:
guns_2010.info()

In [None]:
df_us_census_2.info()

In [None]:
# join the census and gun data

df_us_census_2010 = df_us_census_2['populationestimatesbase_april1_2010_(v2016)']

df_us_census_2010.join(guns_2010)
df_us_census_2010.to_frame().join(guns_2010)

In [None]:
# df_us_census_2010 = pd.Series(df_us_census_2010, index=df_us_census_2010.index)

In [None]:
df_us_census_2010.info()

In [None]:
#here doesnt work
percapita_2010 = df_us_census_2010['totals']/df_us_census_2010['populationestimatesbase_april1_2010_(v2016)']
percapita_2010.sort_values(ascending=False)

In [None]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work, you should save a copy of the report in HTML or PDF form via the **File** > **Download as** submenu. Before exporting your report, check over it to make sure that the flow of the report is complete. You should probably remove all of the "Tip" quotes like this one so that the presentation is as tidy as possible. Congratulations!