Other cool things we can do with visualization!

https://www.linkedin.com/feed/update/urn:li:activity:6968668384691433472/?utm_source=share&utm_medium=member_desktop

## Practical Exercises

In [1]:
import io
import matplotlib
import matplotlib.pyplot as plt
import numpy
import pandas as pd
import seaborn

seaborn.set_context('talk')

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


## Reading the dataset

In [2]:
url = '../data/sysarmy_survey_2020_processed.csv'

# Or we can use this other URL if we are in google colab and we want to read the dataset directly from a server.
# url = 'https://www.famaf.unc.edu.ar/~nocampo043/sysarmy_survey_2020_processed.csv'

df = pd.read_csv(url)

The following code cells separate these programming languages and count how often they appear.

It is not necessary to understand this code in depth, although it is a good exercise.

In [42]:
relevant_columns = ['tools_programming_languages',
                    'salary_monthly_NET',
                    'work_contract_type']

# Convert the comma-separated string of languages to a list of string.
# Remove 'None of the previous one' option, spaces and training commas.
def split_languages(languages_str):
  if not isinstance(languages_str, str):
    return []
  # Remove 'other' option
  languages_str = languages_str.lower()\
    .replace('none of the previous one', '') \
    .replace('none', '')
  # Split string into list of items
  # Remove spaces and commas for each item
  return [lang.strip().replace(',', '')
          for lang in languages_str.split()]

# Create a new column with the list of languages
df.loc[:, 'cured_programming_languages'] = df.tools_programming_languages\
    .apply(split_languages)
if 'cured_programming_languages' not in relevant_columns:
    relevant_columns.append('cured_programming_languages') 

# Duplicate each row of df for each programming language
# mentioned in the response.
# We only include in df_lang the columns we are going to analyze later, so we
# don't duplicate innecesary information.
df_lang = df.cured_programming_languages\
    .apply(pd.Series).stack()\
    .reset_index(level=-1, drop=True).to_frame()\
    .join(df[relevant_columns])\
    .rename(columns={0: 'programming_language'})
# Horrible programming style! But a lot of data science code can be written with
# as concatenations of functions (pipelines), and there's no elegant way of
# doing that on Python.
df_lang = df_lang.drop(columns=["tools_programming_languages", "cured_programming_languages"]).reset_index()

In [43]:
df[:3]

Unnamed: 0,profile_gender,profile_age,work_country,work_province,profile_years_experience,work_years_in_company,work_years_in_current_position,work_people_in_charge_of,profile_studies_level,profile_studies_level_state,...,salary_inflation_adjustment_2020,salary_percentage_inflation_adjustment_2020,salary_month_last_inflation_adjustment,work_has_violence_situations,profile_has_disabilities_hiring_difficulties,company_employee_number,company_main_activity,company_recommended,company_diversity_policies,cured_programming_languages
0,Female,26,Argentina,Ciudad Autónoma de Buenos Aires,3.0,3.0,3.0,0,University,Ongoing,...,No,0.0,0,In my current job,,501-1000,Services / Software Consulting / Digital,7,2,[]
1,Male,29,Argentina,Corrientes,5.0,2.0,2.0,4,University,Ongoing,...,One,10.0,1,Never,No,201-500,Other industries,8,9,"[html, javascript, python]"
2,Female,22,Argentina,Ciudad Autónoma de Buenos Aires,2.0,0.0,0.0,0,Secondary,Complete,...,No,0.0,0,In a previous job,No,2001-5000,Other industries,6,9,[]


In the `programming_language` column, you will find each language separately. Note that if a response contained 3 languages, such as `"HTML, Javascript, Python"`, the row has been replicated 3 times. Therefore, there are three rows with index 1.

In [44]:
df_lang[:3]

Unnamed: 0,index,programming_language,salary_monthly_NET,work_contract_type
0,1,html,63000.0,Full-Time
1,1,javascript,63000.0,Full-Time
2,1,python,63000.0,Full-Time


# Exercises

Using the `df`dataframe:

1) Remove the outliers in `df` using the `salary_monthly_NET` column. Consider an outlier all the points which are 2.5 standard deviation from the mean.

2) Plot an scatterplot comparing the `salary_monthly_NET` and the `salary_monthly_GROSS` painting with two different colors points that have a Full-time contract and those with a Part-Time contract.

3) Plot two boxplots comparing the `salary_monthly_NET` of `Full-time` and `Part-time` devs.

Using the `df_lang` dataframe:

4) Use the `programming_language` column and filter all the languages which are not [`javascript`, `sql`, `html`, `python`, `java`]. For those 5 languages create a `barplot` and a `displot` comparing the `salary_montly_NET`. HELP: For the `displot` check the parameter `col` in `seaborn`.

5) Calculate the conditional probability of A = "earning more than $50000" if you use the programming language $L = python$, that is, $P(A|L)$.

6) (Optional) Calculate for each programming language X, the number of developers that use X. HELP: Check the method `value_counts()` in pandas for a random variable (column). Then, obtain the 5 most used programming languages.