# pandas for Data Science

![Data Science Workflow](img/ds-workflow.png)

## pandas
- When working with tabular data (spreadsheets, databases, etc) **pandas** is the right tool
- **pandas** makes it easy to acquire, explore, clean, process, analyze, and visualize your data
- This basically covers the full Data Science process

## pandas help
- **pandas** is a large tool but also complex
- **pandas** can do (almost) everything with data
    - if you can do it in Excel, you can do it in **pandas**
- **pandas** has a great [Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) to help you
- **pandas** also has great [tutorials](https://pandas.pydata.org/docs/getting_started/index.html)

## What will we cover here?
- Some insights into **DataFrames** (the main datastructure in **pandas**)
- How to work with data

## This course also covers
- Later we will dive into how **pandas** can get data from various sources
    - Web Scraping, Databases, CSV, Parquet, Excel files
- How to combine data from different sources
- How to deal with missing data

## Getting started with pandas
- **pandas** is installed by default in anaconda (JuPyter Notebooks)
- In other environments you can install it with
    - ```pip install pandas```
- To access **pandas** you need to import it
    - ```import pandas as pd```

In [1]:
import pandas as pd

### What is pandas?
- **pandas** is like an Excel sheet - just better
- to learn pandas, let's play with some data

### Read data from CSV
- What is CSV? See this lecture ([Lecture on CSV](https://youtu.be/LEyojSOg4EI))
- ```pd.read_csv(filename, parse_dates, index_col)``` ([docs](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html))
    - ```filename```: The path to the filename
    - ```parse_dates=True```: If True -> try parsing the index (default False)
    - ```index_col=0```: Set the index to be column 0

In [5]:
data=pd.read_csv(r"C:\Users\Haseeb Aqeel\Desktop\Data science books\starter\starter\files\data_science_salaries.csv",index_col=0)

In [6]:
data.head()

Unnamed: 0_level_0,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
work_year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2021e,EN,FT,Data Science Consultant,54000,EUR,64369,DE,50,DE,L
2020,SE,FT,Data Scientist,60000,EUR,68428,GR,100,US,L
2021e,EX,FT,Head of Data Science,85000,USD,85000,RU,0,RU,M
2021e,EX,FT,Head of Data,230000,USD,230000,RU,50,RU,L
2021e,EN,FT,Machine Learning Engineer,125000,USD,125000,US,100,US,S


### Always check data
- The ```.head()```: prints the first 5 columns

In [7]:
data.head()

Unnamed: 0_level_0,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
work_year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2021e,EN,FT,Data Science Consultant,54000,EUR,64369,DE,50,DE,L
2020,SE,FT,Data Scientist,60000,EUR,68428,GR,100,US,L
2021e,EX,FT,Head of Data Science,85000,USD,85000,RU,0,RU,M
2021e,EX,FT,Head of Data,230000,USD,230000,RU,50,RU,L
2021e,EN,FT,Machine Learning Engineer,125000,USD,125000,US,100,US,S


## Index and columns
- ```.index```: Returns the index
- ```.columns```: Returns the column names in a list

In [8]:
data.index

Index(['2021e', '2020', '2021e', '2021e', '2021e', '2021e', '2020', '2020',
       '2020', '2021e',
       ...
       '2021e', '2021e', '2021e', '2021e', '2021e', '2020', '2021e', '2020',
       '2020', '2021e'],
      dtype='object', name='work_year', length=245)

In [9]:
data.shape

(245, 10)

In [10]:
data.columns

Index(['experience_level', 'employment_type', 'job_title', 'salary',
       'salary_currency', 'salary_in_usd', 'employee_residence',
       'remote_ratio', 'company_location', 'company_size'],
      dtype='object')

## Each column has a data type
- ```.dtypes```: Returns the data types of each column

In [11]:
data.dtypes

experience_level      object
employment_type       object
job_title             object
salary                 int64
salary_currency       object
salary_in_usd          int64
employee_residence    object
remote_ratio           int64
company_location      object
company_size          object
dtype: object

## The size and shape of data
- ```len(data)```: gives the number of rows in the DataFrame
- ```.shape```: Returns the number of rows and columns in the DataFrame

In [12]:
len(data)

245

In [13]:
data.shape

(245, 10)

## Slicing rows and columns
- ```data['Close']```: Select one column (Series)
- ```data[['Open', 'Close']]```: Select multiple columns with specific names
- ```data.loc['2020-05-01':'2021-05-01']```: Select all columns between the dates (including 2021-05-01)
- ```data.iloc[50:55]```: Select all columns between rows 50-55 (excluding 55)

In [14]:
data.iloc[50:55]

Unnamed: 0_level_0,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
work_year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2021e,SE,FT,Machine Learning Engineer,80000,EUR,95362,DE,50,DE,L
2021e,EN,FT,Data Engineer,2250000,INR,30509,IN,100,IN,L
2021e,SE,FT,Data Engineer,150000,USD,150000,US,100,US,M
2021e,SE,FT,Data Engineer,115000,USD,115000,US,100,US,S
2021e,MI,FT,Research Scientist,235000,CAD,187917,CA,100,CA,L


## Arithmetic operations
- Calculating with columns on all rows
    - Example: ```data['Close'] - data['Open']```
- Creating new columns
    - Example: ```data['New'] = data['Open'] - data['Close']```

In [19]:
data['sub']=data['salary']- data['salary_in_usd']

In [20]:
data.head()

Unnamed: 0_level_0,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size,sub
work_year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2021e,EN,FT,Data Science Consultant,54000,EUR,64369,DE,50,DE,L,-10369
2020,SE,FT,Data Scientist,60000,EUR,68428,GR,100,US,L,-8428
2021e,EX,FT,Head of Data Science,85000,USD,85000,RU,0,RU,M,0
2021e,EX,FT,Head of Data,230000,USD,230000,RU,50,RU,L,0
2021e,EN,FT,Machine Learning Engineer,125000,USD,125000,US,100,US,S,0


## Select data
- Select data based boolean expressions
    - Example: ```data['New'] > 0```
    - Example: ```data[data['New'] > 0]```

In [25]:
data['salary']>100000

work_year
2021e    False
2020     False
2021e    False
2021e     True
2021e     True
         ...  
2020      True
2021e     True
2020      True
2020     False
2021e     True
Name: salary, Length: 245, dtype: bool

In [26]:
data[data['salary']>100000]

Unnamed: 0_level_0,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size,sub
work_year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2021e,EX,FT,Head of Data,230000,USD,230000,RU,50,RU,L,0
2021e,EN,FT,Machine Learning Engineer,125000,USD,125000,US,100,US,S,0
2021e,SE,FT,Data Analytics Manager,120000,USD,120000,US,100,US,M,0
2020,MI,FT,Research Scientist,450000,USD,450000,US,0,US,M,0
2021e,SE,FT,Data Science Engineer,159500,CAD,127543,CA,50,CA,L,31957
...,...,...,...,...,...,...,...,...,...,...,...
2021e,SE,FT,Data Specialist,165000,USD,165000,US,100,US,L,0
2020,SE,FT,Data Scientist,412000,USD,412000,US,100,US,L,0
2021e,MI,FT,Principal Data Scientist,151000,USD,151000,US,100,US,L,0
2020,EN,FT,Data Scientist,105000,USD,105000,US,100,US,S,0


## Groupby and value_counts
- Example
```Python
data['Category'] = data['New'] > 0
data.groupby('Category').mean()
```
- Example
```Python
data['Category'].value_counts()
(data['New'] > 0).value_counts()
```

In [28]:
data.groupby('experience_level').mean()

Unnamed: 0_level_0,salary,salary_in_usd,remote_ratio,sub
experience_level,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
EN,365722.518519,59753.462963,70.37037,305969.055556
EX,220000.0,226288.0,72.727273,-6288.0
MI,706940.883495,85738.135922,65.048544,621202.747573
SE,365439.181818,128841.298701,73.376623,236597.883117


In [29]:
data.groupby('salary_currency').mean()

Unnamed: 0_level_0,salary,salary_in_usd,remote_ratio,sub
salary_currency,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
BRL,85800.0,16026.0,0.0,69774.0
CAD,121300.0,96996.9,80.0,24303.1
CLP,30400000.0,40798.0,100.0,30359200.0
CNY,299000.0,43331.0,0.0,255669.0
DKK,240000.0,37373.0,50.0,202627.0
EUR,57258.6,67260.77193,62.280702,-10002.18
GBP,63796.62,86447.076923,61.538462,-22650.46
HUF,11000000.0,36233.5,50.0,10963770.0
INR,1944524.0,26353.095238,57.142857,1918171.0
JPY,4450000.0,41689.0,100.0,4408311.0


In [30]:
data.groupby('employee_residence').mean()

Unnamed: 0_level_0,salary,salary_in_usd,remote_ratio,sub
employee_residence,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AE,115000.0,115000.0,0.0,0.0
AT,72500.0,82683.5,25.0,-10183.5
BE,75000.0,89402.0,100.0,-14402.0
BG,80000.0,80000.0,100.0,0.0
BR,85900.0,51013.0,50.0,34887.0
CA,127222.2,101732.555556,83.333333,25489.67
CL,30400000.0,40798.0,100.0,30359200.0
CN,299000.0,43331.0,0.0,255669.0
CO,21844.0,21844.0,50.0,0.0
DE,71501.05,84501.0,57.894737,-12999.95
