### Running a regression after merging
One of the most powerful ways to use data sets is by merging them. This notebook goes over the following things:
- preparing data for merging by cleaning it
- concatenating data sets 
- merging data based on a common column

After merging the data, we will also do a simple linear regression.

In [None]:
import pandas as pd

### Concatenation 
Concatenation is the process of combining data sets that all have the same column headers. Think of it as a way to combinging thousands of rows of data.

In [None]:
charges_01 = pd.read_csv('../data/SH Charge Receipts - 01.csv')
charges_02 = pd.read_csv('../data/SH Charge Receipts - 02.csv')
charges_03 = pd.read_csv('../data/SH Charge Receipts - 03.csv')

In [None]:
print(len(charges_01), len(charges_02), len(charges_03))

In [None]:
charges_01.head()

In [None]:
all_charges = pd.concat([ charges_01, charges_02, charges_03])
print(len(all_charges))

In [None]:
all_charges.head()

In [None]:
all_charges.dtypes

### Finding values that can facilitate merging
- load the data set you want to merge with your other data set
- modify your original data set to make sure you have a common column to merge on
- merge your data sets!

In [None]:
# Read csv of economic data
economic_data = pd.read_csv('../data/bls_sector_metrics.csv')
economic_data.head()

In [None]:
len(economic_data)

In [None]:
economic_data.dtypes

In [None]:
economic_data['naics_sector'] = economic_data['naics_sector'].astype(str)

In [None]:
economic_data.dtypes

In [None]:
all_charges['R_NAICS_CODE'] = all_charges['R_NAICS_CODE'].astype(str)

In [None]:
all_charges.dtypes

In [None]:
all_charges['naics_sector'] = all_charges['R_NAICS_CODE'].apply(lambda x: x[:2])

In [None]:
all_charges.head()

In [None]:
all_charges_economic_data = pd.merge(
    all_charges,
    economic_data,
    on='naics_sector',
    how = 'inner'

)

In [None]:
all_charges_economic_data.head()

#### Regression analysis 

In a linear regression we try to predict how one variable (the dependent variable) changes depending on another variable (independent variable). In this data analysis we're looking at the number of sexual harassment claims and will ask things like: how does the 

In the next cells we will:
- plot our datapoints
- run a regression
First let's install libraries that we need:

In [None]:
# !pip3 install scikit-learn statsmodels numpy matplotlib

### Preparing the data for a linear regression

Let's normalize the data 

Now let's plot the data 

In [None]:
# Relationship between Salary and Experience


#### Preparing the data for the regression analysis
Here we will split our data into Independent/Dependent variables: number of claims per 10,000 people (X) is the independent variable. The percentage of women in that sector (y) is dependent on experience.


In [None]:
# Prepare the data for regression


#### Running a simple regression

Below wer're now using the statsmodel library to run a regression on these numbers

In [None]:
# Fit the model

# Print the summary

# Print the summary


In [None]:

# Create a scatter plot with the regression line

# Add labels for each point




In [None]:
# Print R-squared and coefficients
