<h1> What CS field maximizes your potential for salary growth? </h1>
    by Almamy Bah and Oscar Herrera
    
<h2> Introduction </h2> 

After graduating, a sizable percentage of undergraduate students join the workforce without having a basic understanding of the earning potential of their learned talents. This is especially true for those studying computer science who aren't sure what career path to take. It is crucial that these students have access to trustworthy information about the financial possibilities for various career pathways, including data science, cyber security, and software engineering, as compensation is one of the most crucial aspects in employment selection.
By using the data science pipeline to examine the available data, our goal is to give computer science students the greatest financial options possible. Through an examination of average pay for various job titles, degrees of expertise, and organizations, we will look into how to maximize their earning potential. Furthermore, by investigating which states offer the best incomes for a certain degree, we will examine how geography influences income. We will also look at non-job-related variables like ethnicity, gender, and educational attainment that could have an ongoing impact on income.
By providing comprehensive and accurate information, our aim is to assist computer science students in making informed decisions about their future career paths and maximizing their earning potential. Ultimately, we believe that this will help to create a more equitable and prosperous job market for all computer science graduates.


<h1> Data Collection </h1>
Data Collection:
For this step, we looked for a dataset that is related to the salary of computer science jobs. More specifically, we look into a dataset from Kaggle related to Data Science and Stem Salaries. It contains more than 62,000 salary records from top companies. It contains useful data such as gender, race, level of experience, base salary etc… that we think is enough for our tutorial. The data set can be found <a href= https://www.kaggle.com/datasets/jackogozaly/data-science-and-stem-salaries> here </a>.




First we will import the libraries we will use for the project:

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

db = pd.read_csv("final_data.csv")
print(db)

                timestamp     company     level                         title  \
0       6/7/2017 11:33:27      Oracle        L3               Product Manager   
1      6/10/2017 17:11:29        eBay      SE 2             Software Engineer   
2      6/11/2017 14:53:57      Amazon        L7               Product Manager   
3       6/17/2017 0:23:14       Apple        M1  Software Engineering Manager   
4      6/20/2017 10:58:51   Microsoft        60             Software Engineer   
...                   ...         ...       ...                           ...   
62637   9/9/2018 11:52:32      Google        T4             Software Engineer   
62638   9/13/2018 8:23:32   Microsoft        62             Software Engineer   
62639  9/13/2018 14:35:59        MSFT        63             Software Engineer   
62640  9/16/2018 16:10:35  Salesforce  Lead MTS             Software Engineer   
62641   1/29/2019 5:12:59       apple      ict3             Software Engineer   

       totalyearlycompensat

Next, we will get rid of the columns containing extraneous data for our project

In [15]:
db = db.drop("yearsatcompany", axis=1)
db = db.drop("tag", axis=1)
db = db.drop("cityid", axis=1)
db = db.drop("otherdetails", axis=1)
db = db.drop("dmaid", axis=1)
db = db.drop("rowNumber", axis=1)
db = db.drop("Masters_Degree", axis=1)
db = db.drop("Bachelors_Degree", axis=1)
db = db.drop("Doctorate_Degree", axis=1)
db = db.drop("Highschool", axis=1)
db = db.drop("Some_College", axis=1)
db = db.drop("Race_Asian", axis=1)
db = db.drop("Race_White", axis=1)
db = db.drop("Race_Two_Or_More", axis=1)
db = db.drop("Race_Black", axis=1)
db = db.drop("Race_Hispanic", axis=1)

print(db)

                timestamp     company     level                         title  \
0       6/7/2017 11:33:27      Oracle        L3               Product Manager   
1      6/10/2017 17:11:29        eBay      SE 2             Software Engineer   
2      6/11/2017 14:53:57      Amazon        L7               Product Manager   
3       6/17/2017 0:23:14       Apple        M1  Software Engineering Manager   
4      6/20/2017 10:58:51   Microsoft        60             Software Engineer   
...                   ...         ...       ...                           ...   
62637   9/9/2018 11:52:32      Google        T4             Software Engineer   
62638   9/13/2018 8:23:32   Microsoft        62             Software Engineer   
62639  9/13/2018 14:35:59        MSFT        63             Software Engineer   
62640  9/16/2018 16:10:35  Salesforce  Lead MTS             Software Engineer   
62641   1/29/2019 5:12:59       apple      ict3             Software Engineer   

       totalyearlycompensat

<h1> Data Processing </h1>

Now we will remove rows containing null values for gender and race.

In [30]:
dbf_gender = db[db['gender'].isnull() == False]

dbf_race = db[db['Race'].isnull() == False]

print(dbf_race)


                 timestamp   company                level  \
6921     6/1/2019 20:54:48  Facebook                  IC5   
8366     7/2/2019 16:43:16     Intel              Grade 7   
10937   9/15/2019 20:11:14   Comcast           Engineer 2   
11997  10/13/2019 11:43:20     Latch              Manager   
14429  12/30/2019 11:16:12    Intuit  Software Engineer 1   
...                    ...       ...                  ...   
61982    3/9/2021 17:03:07    Google                  L10   
61984   3/25/2021 10:45:03    Zapier                   L8   
61986   5/14/2021 13:30:43    Amazon                   L8   
61987   5/18/2021 15:34:21  Facebook                   D1   
61991   7/30/2021 22:23:24  Facebook                   E9   

                              title  totalyearlycompensation  \
6921               Product Designer                   310000   
8366              Hardware Engineer                   200000   
10937             Software Engineer                   103000   
11997  Soft