# Project Overview

The Trump Administration, in association with the Department of Homeland Security, has put forward a proposal to change some of the requirements for H1B visa application which is one of the most common way foreign nationals get to work in American companies. In this project, I am going to explore the salaries of Software Engineers in the United States and try to see how it is going to affect non-immigrant Software Engineers. This effect on non-immigrant Software Engineers might also help us see how Tech companies (big and small) are going to be affected by this rule.

# Problem Description
With the new possible changes to H1B visa application requirements, entry-level candidates must have  a salary that is 45th percentile of their profession's salary. Similarly, high-skilled candidates must have a salary that is 95th percentile of their profession's salary. These numbers are up from existing 17th percentile and 67th percentile for entry-level and high-skilled candidates respectively. This is going to curb the number of applicants for the visa and overall create a negative sentiment among skilled foreign workers who have been major contributors in the field of science and technology being part of American companies. This, in turn, will affect American Tech companies which might suffer from brain drain in case this rule is implemented.

# Subject Matter Expertise

In bulleted format, describe the subject matters that will help you explore your topic. Example:
 
1. Data Analysis: I am going to use two different data sources to explore the recent amounts of salary of Software Engineers in the United States. Then, I am going to compare and contrast how the rule change will result in groups of Software Engineers who will be affected by the current rule implementation. In addition to this, I am going to explore the 2019 H1B dataset provided by the US government to see which companies hire most candidates in H1B visa, compare their average salaries and see how they will get affected.

# TODO
2. Data Visualization
3. Statistics and Probability
4. Hypothesis Testing
5. Linear Regression

# Assumptions
The assumption is that more than number of eligible H1B candidates will be reduced to half if the new rule is implemented.


# Steps to Explore the Topic and Problem

1. Download H1B employers data from USCIS website
2. Web-scrape data from Glassdoor for Software Engineer Salaries in the US
3. Download data from Stackoverflow Survey 2020 consisting of Software Engineer total compensation from all around the world
4. Filter Stackoverflow data to have only rows relating to United States and that have total compensation stated
5. Calculate the 17th, 45th, 67th, and 95th percentile of salaries of Software Engineers in the US from the Stackoverflow survey data

same for glassdoor dataset
some form of visualization to make it clear in diagrams, perhaps a bar chart or a histogram for both datasets
explore the h1b dataset to see which companies employ the most h1b employers in the usa and see their average salaries and how they will be affected by the new percentile distribution
6. Show relationship between actors' salary living in CA vs outside CA with bar chart
7. Calculate median home price of homes in CA using web scrape data
8. Compare this to median salary for actors using bar chart
9. Make some intitial Conclusions on actor salary, location, and median home price on if actors can afford to live in CA.

# Data Sources:

In bulleted format, list where you will get data from. Data sources must include one existing data source and web-scrapted source. Example:

1. U.S. Census Estimates of the Total Resident Population and Resident Population Age 18 Years and Older for the United States, States, and Puerto Rico `https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/state/detail/SCPRC-EST2019-18+POP-RES.csv`
2. Kaggle Elon Musk Tweets 'https://www.kaggle.com/kingburrito666/elon-musk-tweets'

# Data Exploration
Describe the data using what you know. For larger datasets may have to pull out columns that are of interst to you.

In [16]:
import pandas as pd
import numpy as np

so = pd.read_csv("stackoverflow_survey_results_public.csv")
so


Unnamed: 0,Respondent,MainBranch,Hobbyist,Age,Age1stCode,CompFreq,CompTotal,ConvertedComp,Country,CurrencyDesc,...,SurveyEase,SurveyLength,Trans,UndergradMajor,WebframeDesireNextYear,WebframeWorkedWith,WelcomeChange,WorkWeekHrs,YearsCode,YearsCodePro
0,1,I am a developer by profession,Yes,,13,Monthly,,,Germany,European Euro,...,Neither easy nor difficult,Appropriate in length,No,"Computer science, computer engineering, or sof...",ASP.NET Core,ASP.NET;ASP.NET Core,Just as welcome now as I felt last year,50.0,36,27
1,2,I am a developer by profession,No,,19,,,,United Kingdom,Pound sterling,...,,,,"Computer science, computer engineering, or sof...",,,Somewhat more welcome now than last year,,7,4
2,3,I code primarily as a hobby,Yes,,15,,,,Russian Federation,,...,Neither easy nor difficult,Appropriate in length,,,,,Somewhat more welcome now than last year,,4,
3,4,I am a developer by profession,Yes,25.0,18,,,,Albania,Albanian lek,...,,,No,"Computer science, computer engineering, or sof...",,,Somewhat less welcome now than last year,40.0,7,4
4,5,"I used to be a developer by profession, but no...",Yes,31.0,16,,,,United States,,...,Easy,Too short,No,"Computer science, computer engineering, or sof...",Django;Ruby on Rails,Ruby on Rails,Just as welcome now as I felt last year,,15,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64456,64858,,Yes,,16,,,,United States,,...,,,,"Computer science, computer engineering, or sof...",,,,,10,Less than 1 year
64457,64867,,Yes,,,,,,Morocco,,...,,,,,,,,,,
64458,64898,,Yes,,,,,,Viet Nam,,...,,,,,,,,,,
64459,64925,,Yes,,,,,,Poland,,...,,,,,Angular;Angular.js;React.js,,,,,


In [22]:
so_filtered = so.filter(items= ["CompFreq", "CompTotal", "ConvertedComp", "Country"])
so_filtered

Unnamed: 0,CompFreq,CompTotal,ConvertedComp,Country
0,Monthly,,,Germany
1,,,,United Kingdom
2,,,,Russian Federation
3,,,,Albania
4,,,,United States
...,...,...,...,...
64456,,,,United States
64457,,,,Morocco
64458,,,,Viet Nam
64459,,,,Poland


In [23]:
so_usa = so_filtered[so.Country == "United States"]
so_usa = so_usa[so.CompTotal.isna() == False]
so_usa

Unnamed: 0,CompFreq,CompTotal,ConvertedComp,Country
7,Yearly,116000.0,116000.0,United States
13,Yearly,66000.0,66000.0,United States
16,Yearly,79000.0,79000.0,United States
17,Monthly,105000.0,1260000.0,United States
18,Yearly,83400.0,83400.0,United States
...,...,...,...,...
64116,Yearly,150000.0,150000.0,United States
64122,Yearly,70000.0,70000.0,United States
64127,Yearly,140000.0,140000.0,United States
64129,Weekly,3000.0,150000.0,United States


In [21]:
# Since I see that monthly and weekly fillings of salaries have discrepancies, I am going to get rid of them as they do not make up for a large portion of the dataset. A small sacrifice for consistency's sake.
so_usa_yearly = so_usa[so_usa.CompFreq == "Yearly"]
so_usa_yearly

# Further data filtering is needed as there are outliers in the data

Unnamed: 0,CompFreq,CompTotal,ConvertedComp,Country
7,Yearly,116000.0,116000.0,United States
13,Yearly,66000.0,66000.0,United States
16,Yearly,79000.0,79000.0,United States
18,Yearly,83400.0,83400.0,United States
40,Yearly,106000.0,106000.0,United States
...,...,...,...,...
64113,Yearly,225000.0,225000.0,United States
64116,Yearly,150000.0,150000.0,United States
64122,Yearly,70000.0,70000.0,United States
64127,Yearly,140000.0,140000.0,United States


# Data Cleaning
Show techniques you use to reduce impact of outliers, drop missing, or null values (if any). 

Must show the total number of null or missing values.

Must describe rows or columns dropped.

What was the strategy for dropping outliers?

# Describe the Data Using Descriptive Stats
Must show total number of rows and columns in dataset(s).

Use descriptive stats to tell us about your data. Must include mean, median, and mode where applicable. Also must talk about normality of data.

Remember what `standard deviation`, `mean`, `central tendency`, and `variance` mean for your data

# Data Visualization
Describe the plots you will use to visualize you data. Must use histogram, bar charts, and pie charts to describe aspects abou tthe data

# Topic Conclusions
Based on what you found about your topic communicate it to the audience. How did the data analysis steps you took help you solve the problem or find out more information about the problem. Example:

I wanted to explore if the average actor can live comfortably in California. I downloaded this dataset s and webscrapped this dataset. I did some stuff. I found that actors that live outside of california can afford to buy a home. Others can't. Blah blah blah blah

# Future Exploration
List what you wish you could do with more knowledge about the topic. Also use this section to save aspects of your data into a new csv file for use in this future exploration. 