# **Exploratory Data Analysis**
Let's take a look at the cleaned up data file prepared from the webscraped data from Glassdoor.

In [1]:
#import appropriate libraries
#!pip install seaborn
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [5]:
#load the .csv saved in the cleanup notebook into a dataframe
file = r"C:\Users\Tineash\Projects\Glassdoor_webscraper\Data\AV_eng_data_cleaned.csv" #copy the file path between the double quotes
df = pd.read_csv(file)
df

Unnamed: 0,Job Title,Salary Minimum,Salary Maximum,Salary Average,Rating,Company Name,Location,Size,Founded,Type of ownership,Industry,Sector,Revenue,Average Hourly Rate,Converted Salary,City,State,Company Age (years),Title Grouping,Seniority Level
0,Audio Visual System Design Engineer,61,100.0,80500.0,,CCS,"Denver, CO",,,,,,,0,80500,Denver,CO,,engineer,na
1,Audio Visual Design Engineer,85,110.0,97500.0,4.3,AV-Worx,"West Palm Beach, FL",1 to 50 Employees,2014.0,Company - Private,Business Consulting,Management & Consulting,$5 to $10 million (USD),0,97500,West Palm Beach,FL,8.0,engineer,na
2,Audio Visual Systems Field Engineer,48,83.0,63005.0,4.3,System Source,"Hunt Valley, MD",51 to 200 Employees,1981.0,Company - Private,Information Technology Support Services,Information Technology,$10 to $25 million (USD),0,63005,Hunt Valley,MD,41.0,engineer,na
3,(NY) Audio/Visual Design Engineer,50,98.0,69978.0,3.8,A-V Services Inc.,"New York, NY",201 to 500 Employees,1960.0,Company - Private,Telecommunications Services,Telecommunications,Unknown / Non-Applicable,0,69978,New York,NY,62.0,engineer,na
4,Audio Visual Engineer,44,103.0,67259.0,3.1,JVN Systems Inc.,"Deer Park, NY",1 to 50 Employees,,Company - Private,,,$5 to $10 million (USD),0,67259,Deer Park,NY,,engineer,na
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
811,Audio Visual Field Engineer,20.00,40.0,30.0,,"Vistacom, Inc",Pennsylvania,,,,,,,1,62400,Pennsylvania,PA,,engineer,na
812,"Pre-Sales Design Engineer, Audio Visual Remote",56,102.0,75493.0,3.6,Johnson Controls,"Roswell, GA",10000+ Employees,1885.0,Company - Public,Machinery Manufacturing,Manufacturing,$10+ billion (USD),0,75493,Roswell,GA,137.0,engineer,na
813,Audio Visual Sales Engineer,70,80.0,75000.0,,Vario,Remote,,,,,,,0,75000,Remote,Remote,,engineer,na
814,Audio Visual Systems Engineer,70,70.0,70000.0,4.2,The Mom Project,"New York, NY",51 to 200 Employees,2016.0,Company - Private,HR Consulting,Human Resources & Staffing,Unknown / Non-Applicable,0,70000,New York,NY,6.0,engineer,na


In [6]:
#Let's remind ourselves what data attributes we have
df.columns

Index(['Job Title', 'Salary Minimum', 'Salary Maximum', 'Salary Average',
       'Rating', 'Company Name', 'Location', 'Size', 'Founded',
       'Type of ownership', 'Industry', 'Sector', 'Revenue',
       'Average Hourly Rate', 'Converted Salary', 'City', 'State',
       'Company Age (years)', 'Title Grouping', 'Seniority Level'],
      dtype='object')

There are a few things I am interested in looking over at first glance. First, let's begin with the **continuous values**. Company age, converted salary, and rating could all use a quick glance. Let's also plot out a boxplot comparing the minimum and maximum salaries. 

In [None]:
company_age_hist=df['Company Age (years)'].hist()
company_age_hist

In [None]:
company_rating_hist=df['Rating'].hist(range=[0,5])
company_rating_hist

In [None]:
avg_salary_hist = df['Converted Salary'].hist()
avg_salary_hist

In [None]:
rating_boxplot = df.boxplot(column = 'Rating')
rating_boxplot

In [None]:
avg_salary_boxplot = df.boxplot(column = 'Converted Salary')
avg_salary_boxplot


In [None]:
company_age_boxplot = df.boxplot(column = 'Company Age (years)')
company_age_boxplot

In [None]:
min_v_max_salary = df.boxplot(column = ['Salary Minimum', 'Salary Maximum'])
min_v_max_salary

This comparison tells us that the median salary minimum is in the upper $50k, the median salary maximum is in the mid $90k, and the bulk of the ranges are $50k-$80k for the minimum and $80k-110k for the maximum. It would be nice to include a median salary across all jobs to compare this field to all fields, but that will have to be at a different time. 

In [None]:
#Let's see if there are correlations between some continuous variables
df[['Company Age (years)', 'Rating', 'Converted Salary']].corr() # get positive/negative coorelation between all variables listed

In [None]:
cmap = sns.diverging_palette(220,10,as_cmap=True)
sns.heatmap(df[['Company Age (years)', 'Rating', 'Converted Salary']].corr(),vmax=.3, center=0, cmap=cmap,
            square=True,linewidths=.5,cbar_kws={"shrink":.5})

Above, we see that there is a very slight positive coorelation between the rating of the company and the yearly salary of the company. The age of the company has a slight negative coorelation with yearly salary.

The bulk of the continuous data has now been analyzed, so lets take a look at the categorical data. Let's once again review our columns.

In [None]:
df.columns

Scanning through the data, it would be useful to see how location data, such as City and State, as well as the size of the company coorelates with salary. In addition, we should compare salaries that were hourly to salaries that were yearly to see if they tend to be lower. We can also compare seniority levels, but I image that will clearly result in a higher salary for more seniority. The type of ownership and industry/sector should be looked at to see what sectors pay more. This could be useful information for someone who wants to specialize in a certain subfield of data analytics and is looking for the greatest return on investment. 

In [None]:
df_cats = df[['Size', 'Type of ownership', 'Industry', 'Sector', 'Revenue', 'City', 'State', 'Seniority Level']]

In [None]:
sns.set(rc={'figure.figsize':(10,8)},font_scale=0.8)
for i in df_cats.columns:
    cat_num = df_cats[i].value_counts()
    print("The total groupings for %s is %d" % (i, len(cat_num)))
    graph = sns.barplot(x=cat_num.index, y=cat_num, data=df_cats)
    graph.set_xticklabels(graph.get_xticklabels(), rotation=90)
    #insert line to save strings here
    plt.show()
    

Let's walk through each graph and see if there are any stand out insights at a glance:
* **Size** - The largest company bin (10000+ employeers) comprises the bulk of the postings, followed by small enterprises (200-500 employees)
* **Ownership** - Private and publicly held companies have the highest postings for data analysts. Surprisingly, government and university jobs dont have very many postings. For universities this is especially surprising, as there are ample opportunities to look at large data sets to determine student success or happiness. 
* **Industry** - Human Resources, Insurance, Healthcare and Business consulting top the list for industries with data analyst job postings on Glassdoor. Surprisingly, biotech is at the low end of the list. Maybe this is due to data analytics being lumped in with the duties of a research scientist. One thing to note is the **broad** range of industries needing data analyst services. Even in this small sample set, **37** different represented industries are observed!
* **Revenue** The Revenue ranges given are roughly equivalent to each other in terms of representation. A good portion of the postings did not have an annual revenue, so perhaps supplimenting this data with a seperate datasource would be helpful if we wish to analyze based off revenue further. 
* **Location - city** - Large cities hold the most postings for data analysts, with New York topping the list. Surprisingly, remote jobs are the largest posting type! Good for people who have the skills but don't wish to change locations. 
* **Location - state** - Again, the bulk of locations are remote, with New York state holding the most postings after that. I was surprised to see California and Washington not represented more heavily, but perhaps those postings were listed as remote since they have more experience with that work being farmed out. A question to follow up with later. 
* **Seniority level** - Junior and senior positions based on text scraping are equally represented, but I need to go back and tweek the code to strip out I/II/III from the data and bin them into the appropriate seniority levels. As it stands, I won't be able to get very much out of this. 

In [None]:
df.columns

In [None]:
pd.pivot_table(df, index = 'Title Grouping')

In [None]:
df["Job Title"].value_counts().head(30)

In [None]:
pd.set_option('display.max_rows', None)
pd.pivot_table(df, index = ["Job Title", "Seniority Level"], values = 'Converted Salary').sort_values('Converted Salary', ascending = False)

At this time, I need to go back through and rebinn the titles, including Entry-level = junior, Master = senior, etc. 

Let's take a look at the average salaries by State/City for these roles. I would hypothesize the costal states/cities would have the larger average salaries, but that will be affected by the types of jobs being posted in each state.

In [None]:
pd.pivot_table(df, index = 'State', values = ['Converted Salary']).sort_values('Converted Salary', ascending = False)

Surprisingly, Georgia tops the list of average salaries, with California close behind. Utah takes the third spot. Michigan, Maryland and Montana are all low on the list. Perhaps this is due to the types of jobs being offered in these states. Let's make a pivot table showing the job titles and how many of those jobs are being offered. 

In [None]:
pd.pivot_table(df, index = ['State', 'Job Title'], values = 'Converted Salary', aggfunc = 'count').sort_values('State', ascending = False)

This gives us some insights. Utah is only hiring two jobs, both at senior pay levels. Georgia is hiring a large amount of cybersecurity data analysts. Meanwhile, Michigan, Montana and Maryland are largely hiring junior positions. This does give us some insights, such as senior level and cybersecurity analysts potentially earning more than general analysts. 

Let's go ahead and loop through all the data in a pivot table to see if anything stands out.

In [None]:
df.columns

In [None]:
df_pivots=df[['Job Title','Salary Minimum','Salary Maximum','Salary Average','Rating','Company Name','Location','Size','Founded','Type of ownership','Industry','Sector','Revenue','Average Hourly Rate', 'City', 'State','Company Age (years)','Title Grouping','Seniority Level','Converted Salary']]
pd.set_option('display.max_rows', None)
for i in df_pivots.columns:
    print(i) # get the column name
    if i == 'Converted Salary':
        pass
    else:
        table = pd.pivot_table(df_pivots, index = i, values='Converted Salary').sort_values('Converted Salary', ascending = False)
        print(table)

A ton of quick insights from this pivot table dump. Unsurprisingly, senior analysts make significantly more than juniors ($102k vs $69k average). Business/Cybersecurity analysts top the list of salary ranges. Smaller companies (defined as 1-200, 500-5000 employees) pay on the lower end of salaries, with 200-500 employee companies being the outlier at the higher end. It would be interesting to see if 200-500 employee companies are requesting more senior/cybersecurity jobs, which skew the average up. Public companies and government jobs pay the highst on average, with university/non-provig being the lowest. Security and HR are on the higher end as well, with biotech, grocery and universities being the bototm. Lastly, the Real Estate, IT and Government sectors are the high end of salaries.

The data bins could us esome work, specifically the title grouping. It would also be nice to have a large dataset to take care of data skewing (such as having states with 1 high job posting skewing results). I could clean the data further by removing states with a single job count ooff the lst. For now, this will work as a launchpad to make some quick ML models for correlation predictions. 