# **Introduction to Python. Day 5**

## *Dr Kirils Makarovs*

## *k.makarovs@exeter.ac.uk*

## *University of Exeter Q-Step Centre*

---


# **Welcome to Day 5!**

## **By now, you should be familiar with:**

+ The overall workflow of Jupyter Notebooks in Google Colab
+ The basics of Python syntax and operations with lists
+ How to read in external datasets
+ How to navigate datasets - subsetting, accessing rows/columns
+ Operations with variables i.e. recoding, creating new variables
+ Exploratory data analysis
+ Data visualization


## **Today, you are going to work on:**

+ The final data analysis report

---



# **1. Preparing to work in Python**

In [None]:
# Import the necessary libraries

import pandas as pd # data analysis and management library
import numpy as np # multi-dimensional arrays
import math # library with math-related commands like square root, etc.
import random # random number generator via random.sample()

# Data visualization libraries

import seaborn as sns # easy-syntax plots
import matplotlib.pyplot as plt # deep-level library used to tweak the details of the seaborn plots


In [None]:
# Mount your Google Drive

# Mounting your Google Drive will enable you to access files from Drive in Google Colab e.g. datasets, notebooks, etc.

from google.colab import drive

# This will prompt for authorization. Enter your authorisation code and rerun the cell

drive.mount('/content/drive')


## **Python Data Science Handbook by Jake VanderPlas**

<figure>
<left>
<img src=https://jakevdp.github.io/PythonDataScienceHandbook/figures/PDSH-cover.png  width="500">
</figure>

*Book is available [here](https://jakevdp.github.io/PythonDataScienceHandbook/).*

*Reproducible notebooks for the handbook are available [here](https://github.com/jakevdp/PythonDataScienceHandbook/tree/master/notebooks).*

---

# **2. Exporting Colab notebooks**

+ *Rough-and-ready PDF:* `File -> Print`

(doesn't always work though and you may lose text if it goes beyond the page limits)

+ *Exporting as HTML:* Instructions with screenshots available [here](https://stackoverflow.com/questions/53460051/convert-ipynb-notebook-to-html-in-google-colab).

    + **Step #1:** Save a copy of the notebook in `.ipynb` format onto your machine and give it a different name. For this, use `File -> Download .ipynb`
    + **Step #2:** Upload the saved notebook into the current Colab session by going to `Files` and clicking on `Upload to session storage`
    + **Step #3:** The notebook should now be visible in the root folder of `Files`. Click on three dots and get the path to the notebook
    + **Step #4:** In the current notebook, create a new code cell and paste the following piece of code:

```
%%shell
jupyter nbconvert --to html <notebook_directory.ipynb>
```
+ 
  + **Step #5:** Change `<notebook_directory.ipynb>` to the pathway that you got in **Step #3** and run the code cell
  + **Step #6:** The HTML file should now be visible in the root folder of `Files`. Click on three dots and download it onto your machine
  + **Step #7:** Find the HTML file on your machine and click on it. It should open in browser
  + **Step #8 (additional):** You can save HTML file as PDF by clicking `ctrl/command + P` in your browser


In [None]:
%%shell
jupyter nbconvert --to html /content/HTML_Python_Day_5.ipynb

---

# **3. Data analysis report**




You are going to work with the **Data Science Job Salaries Dataset**

This is a small dataset on data science job market collected in 2020 and 2021 in various countries. It containts information on 245 employees.

The dataset has the following variables:

+ **work_year**: The year during which the salary was paid
+ **experience_level**: The experience level in the job
+ **employment_type**: The type of employement for the role
+ **job_title**: The role worked in during the year
+ **salary**: The total gross salary amount paid
+ **salary_currency**: The currency of the salary
+ **salary_in_usd**: The salary in US dollars
+ **employee_residence**: Employee's primary country of residence in during the work
+ **remote_ratio**: The overall amount of work done remotely
+ **company_location**: The country of the employer's main office or contracting branch
+ **company_size**: The average number of people that worked for the company during the year

In [None]:
# Let's get the dataset first!

# Note that your pathway to the file might be different from mine
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Intro_to_python/Day_5/Data_Science_Jobs_Salaries.csv')

# Data source: https://www.kaggle.com/saurabhshahane/data-science-jobs-salaries


In [None]:
df.shape # 245 observation and 11 variables

df.columns # names of the variables

df.info() # there are 2 numeric variables (salary, salary_in_usd). All other variables should be treated as categorical


In [None]:
# (just run this cell before starting your data analysis)

# Let me clean the dataset a bit for you:

##########

# work_year variable now has two clean values: 2020 and 2021
df['work_year'].replace(['2021e'], ['2021'], inplace=True)

##########

# experience_level variable now has interpretable levels
df['experience_level'].replace(['MI', 'SE', 'EN', 'EX'],
                        ['Intermediate', 'Expert', 'Junior', 'Director'], inplace=True)

##########

# employment_type variable now has interpretable levels
df['employment_type'].replace(['FT', 'PT', 'CT', 'FL'],
                        ['Full-time', 'Part-time', 'Contract', 'Freelance'], inplace=True)

# Additionally, because of low frequencies, I combine Part-time, Contract, and Freelance into one group
df['employment_type'] = np.where(df['employment_type'].isin(['Part-time', 'Contract', 'Freelance']), 'Part-Time/Contract/Freelance', df['employment_type'])

##########

# There were quite a lot of job titles in the job_title variable who had low frequency (less than 5), so I combine them all into
# one category and give it a name of 'Other'

# Creating a list job titles for whom frequency is less than 5
jobs_low_fr = df['job_title'].value_counts()[df['job_title'].value_counts() < 5].index

# Overriding the job_title variable by saying that everyone with infrequent jobs will go into 'Other' category,
# and the rest will keep their original job titles
df['job_title'] = np.where(df['job_title'].isin(jobs_low_fr), 'Other', df['job_title'])

##########

# There were quite a lot of countries in the employee_residence variable who had low frequency (less than 7), so I combine them all into
# one category and give it a name of 'Other'

# Creating a list of countries for whom frequency is less than 7
countries_res_low_fr = df['employee_residence'].value_counts()[df['employee_residence'].value_counts() < 7].index

# Overriding the employee_residence variable by saying that everyone with infrequent countries will go into 'Other' category,
# and the rest will keep their original countries of residence
df['employee_residence'] = np.where(df['employee_residence'].isin(countries_res_low_fr), 'Other', df['employee_residence'])

##########

# remote_ratio variable now has interpretable levels
df['remote_ratio'].replace([0, 50, 100],
                        ['No remote work', 'Partially remote', 'Fully remote'], inplace=True)

##########

# There were quite a lot of countries in the company_location variable who had low frequency (less than 7), so I combine them all into
# one category and give it a name of 'Other'

# Creating a list of countries for whom frequency is less than 7
countries_loc_low_fr = df['company_location'].value_counts()[df['company_location'].value_counts() < 7].index

# Overriding the company_location variable by saying that everyone with infrequent countries will go into 'Other' category,
# and the rest will keep their original countries of residence
df['company_location'] = np.where(df['company_location'].isin(countries_loc_low_fr), 'Other', df['company_location'])

##########

# company_size variable now has interpretable levels
df['company_size'].replace(['L', 'S', 'M'],
                        ['Large', 'Small', 'Medium'], inplace=True)


In [None]:
# Get the head of the cleaned dataset

df.head(10)


---

# **4. Your task**

*For the final data analysis report, please:*

1. form groups of 2-3 people
2. come up with a small research question that can be answered with the help of the provided dataset (we will discuss some of the examples in class)
3. produce 2-4 tables **and** 2-4 graphs that address your research question 
4. interpret the tables and graphs that you have obtained (you don't have to write any text)
5. send your graphs and tables to k.makarovs@exeter.ac.uk either as a **single** Notebook, **single** PDF file, or **single** HTML file
**(single HTML file is preferable)**
6. present your findings to the class and let's discuss them!






*Here is a cheetsheet with handy Python commands that you can use for the final data analysis report:*

| Command | Description |
| ------ | ----------- |
| `df['variable'].value_counts()`   | Frequency table |
| `pd.crosstab(df['variable_1'], df['variable_2']`) | Crosstab |
| `df.groupby('variable_1')['variable_2'].mean()` | Aggregated analysis|
| `sns.histplot()`, `sns.countplot()`, `sns.scatterplot()`, etc. | Seaborn graphs|
| `np.where(condition, value if True, value if False)` | Creating new variable|

Don't forget that you can also subset the dataset via `.loc[]` method if you want to focus on particular rows/columns of the dataset!



# **That's the end of Day 5!**