<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Python_Data_Analytics_Course/blob/main/2_Advanced/01_Pandas_Accessing_Data.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Pandas Accessing Data

## Review Data

Let's load in the data using `read_csv` and quickly review it using `head()`.

In [1]:
# Importing Libraries
import pandas as pd
from datasets import load_dataset
import matplotlib.pyplot as plt  

# Loading Data
dataset = load_dataset('lukebarousse/data_jobs')
df = dataset['train'].to_pandas()

# Data Cleanup
df['job_posted_date'] = pd.to_datetime(df['job_posted_date'])

In [2]:
df.head()

Unnamed: 0,job_title_short,job_title,job_location,job_via,job_schedule_type,job_work_from_home,search_location,job_posted_date,job_no_degree_mention,job_health_insurance,job_country,salary_rate,salary_year_avg,salary_hour_avg,company_name,job_skills,job_type_skills
0,Data Engineer,DevOps Engineer - Big Data/Advanced Analytics,"Bari, Metropolitan City of Bari, Italy",via LinkedIn,Full-time,False,Italy,2023-10-24 18:16:15,True,False,Italy,,,,NTT DATA Italia,"['shell', 'python', 'bash', 'mongodb', 'mongod...","{'cloud': ['gcp', 'aws', 'azure'], 'databases'..."
1,Data Engineer,"Co-Op/Intern Software Engineer, Data Ingestion","Vancouver, BC, Canada",via LinkedIn,Full-time,False,Canada,2023-05-05 18:37:58,False,False,Canada,,,,Kinaxis,"['java', 'c#', 'mysql', 'oracle', 'docker', 'k...","{'cloud': ['oracle'], 'databases': ['mysql'], ..."
2,Data Engineer,Hybrid - Data Engineer,"New York, NY",via LinkedIn,Full-time,False,Sudan,2023-07-25 19:07:08,False,False,Sudan,year,425000.0,,Durlston Partners,['python'],{'programming': ['python']}
3,Data Scientist,Data Scientist,"Hamtramck, MI",via BeBee,Full-time,False,"Illinois, United States",2023-11-30 18:05:20,False,False,United States,,,,Apexon,"['sql', 'r', 'scala', 'java', 'gcp', 'aws', 'h...","{'analyst_tools': ['tableau', 'sap', 'word'], ..."
4,Data Engineer,Data Lead Engineer (with strong Python) - Remo...,Anywhere,via Jobgether,Full-time,True,Panama,2023-08-13 18:32:30,False,False,Panama,,,,FullStack Labs,"['python', 'aws', 'tableau', 'looker', 'terraf...","{'analyst_tools': ['tableau', 'looker'], 'asyn..."


We learned how to get rows using `iloc[]` before. But we can do a lot more with it. We can actually get rows *and* columns.

## iloc

### Notes

* `df.iloc[]`: Select rows and columns by position.

### Examples
Using `iloc` let's:

1. Get the first row (index 0)
2. Get the `job_skills` element of the 3rd (index at 2) row. 
3. Get the `job_skills` (index 15) and `job_type_skills` (index 16) for the third (index 2) and fourth (index 3) rows.
4. Get the first 12 rows of the DataFrame.
5. Get the first five columns of the DataFrame and all of the rows.

For this we'll need to know the index numbers for our DataFrame.

![image](images/iloc_visual_1.png)

1. Get the first row.

In [3]:
df.iloc[0]

job_title_short                                              Data Engineer
job_title                    DevOps Engineer - Big Data/Advanced Analytics
job_location                        Bari, Metropolitan City of Bari, Italy
job_via                                                       via LinkedIn
job_schedule_type                                                Full-time
job_work_from_home                                                   False
search_location                                                      Italy
job_posted_date                                        2023-10-24 18:16:15
job_no_degree_mention                                                 True
job_health_insurance                                                 False
job_country                                                          Italy
salary_rate                                                            NaN
salary_year_avg                                                        NaN
salary_hour_avg          

2. Get the `job_skills` element of the 3rd (index at 2) row. 

In [4]:
df.iloc[0][15]

  df.iloc[0][15]


"['shell', 'python', 'bash', 'mongodb', 'mongodb', 'cassandra', 'elasticsearch', 'gcp', 'aws', 'azure', 'kafka', 'hadoop', 'docker', 'kubernetes', 'terraform', 'ansible', 'git', 'jenkins']"

##### Note: Use `df.iloc[0, 15]` instead of `df.iloc[0][15]` to ensure future compatibility with pandas.

The use of chained indexing like df.iloc[0][15] is being deprecated in pandas, as it may lead to ambiguous behavior between positional and label-based access in future versions. By using df.iloc[0, 15], you directly specify the position of the data you want to access, which is clearer and avoids potential future errors when pandas changes how integer keys are interpreted in series indexing. 

So we should instead write:

In [5]:
df.iloc[0,15]

"['shell', 'python', 'bash', 'mongodb', 'mongodb', 'cassandra', 'elasticsearch', 'gcp', 'aws', 'azure', 'kafka', 'hadoop', 'docker', 'kubernetes', 'terraform', 'ansible', 'git', 'jenkins']"

3. Get the `job_skills` (index 15) and `job_type_skills` (index 16) for the third (index 2) and fourth (index 3) rows.
    * To get third (index 2) and fourth (index 3) rows: `[2,3]`
    * To get `job_skills` and `job_type_skills` which are index 15 and index 16 retrospectively: `[15,16]`
    * Then you put those two into a list itself to get everything between these two: `df.iloc[[2,3],[15,16]]`

![image](images/iloc_visual_2.png) 

**need to update this visual**

In [6]:
df.iloc[[2,3],[15,16]]

Unnamed: 0,job_skills,job_type_skills
2,['python'],{'programming': ['python']}
3,"['sql', 'r', 'scala', 'java', 'gcp', 'aws', 'h...","{'analyst_tools': ['tableau', 'sap', 'word'], ..."


##### Preview

Below are a few more examples of what `iloc` can do. Pay close attention to these because we'll use them later. 

4. Get the first 10 rows of the DataFrame.

In [7]:
df.iloc[:9]

Unnamed: 0,job_title_short,job_title,job_location,job_via,job_schedule_type,job_work_from_home,search_location,job_posted_date,job_no_degree_mention,job_health_insurance,job_country,salary_rate,salary_year_avg,salary_hour_avg,company_name,job_skills,job_type_skills
0,Data Engineer,DevOps Engineer - Big Data/Advanced Analytics,"Bari, Metropolitan City of Bari, Italy",via LinkedIn,Full-time,False,Italy,2023-10-24 18:16:15,True,False,Italy,,,,NTT DATA Italia,"['shell', 'python', 'bash', 'mongodb', 'mongod...","{'cloud': ['gcp', 'aws', 'azure'], 'databases'..."
1,Data Engineer,"Co-Op/Intern Software Engineer, Data Ingestion","Vancouver, BC, Canada",via LinkedIn,Full-time,False,Canada,2023-05-05 18:37:58,False,False,Canada,,,,Kinaxis,"['java', 'c#', 'mysql', 'oracle', 'docker', 'k...","{'cloud': ['oracle'], 'databases': ['mysql'], ..."
2,Data Engineer,Hybrid - Data Engineer,"New York, NY",via LinkedIn,Full-time,False,Sudan,2023-07-25 19:07:08,False,False,Sudan,year,425000.0,,Durlston Partners,['python'],{'programming': ['python']}
3,Data Scientist,Data Scientist,"Hamtramck, MI",via BeBee,Full-time,False,"Illinois, United States",2023-11-30 18:05:20,False,False,United States,,,,Apexon,"['sql', 'r', 'scala', 'java', 'gcp', 'aws', 'h...","{'analyst_tools': ['tableau', 'sap', 'word'], ..."
4,Data Engineer,Data Lead Engineer (with strong Python) - Remo...,Anywhere,via Jobgether,Full-time,True,Panama,2023-08-13 18:32:30,False,False,Panama,,,,FullStack Labs,"['python', 'aws', 'tableau', 'looker', 'terraf...","{'analyst_tools': ['tableau', 'looker'], 'asyn..."
5,Senior Data Engineer,Senior Data Engineer,"Milford, CT",via Snagajob,Full-time,False,"California, United States",2023-07-28 18:06:44,False,False,United States,,,,"Franchise World Headquarters, LLC","['python', 'nosql', 'sql', 'aws', 'redshift', ...","{'cloud': ['aws', 'redshift'], 'other': ['flow..."
6,Data Engineer,Data Engineer,Texas,via LinkedIn,Full-time,False,"Texas, United States",2023-03-06 18:51:56,False,True,United States,year,180000.0,,"Trepp, Inc.","['sql', 'python', 'java', 'scala', 'aws', 'spa...","{'cloud': ['aws'], 'libraries': ['spark'], 'ot..."
7,Data Analyst,"Applied Mathematician, Scientist, or Engineer","Copenhagen, Denmark",via BeBee Danmark,Full-time,False,Denmark,2023-07-12 18:24:10,False,False,Denmark,,,,Signaloid,"['c', 'python', 'github', 'zoom']","{'other': ['github'], 'programming': ['c', 'py..."
8,Senior Data Engineer,"Sr. Big Data Engineer with Data Bricks, Loc - ...","Quincy, MA",via ZipRecruiter,Full-time,False,"California, United States",2023-05-25 18:05:51,True,False,United States,,,,Apptad Inc,"['sql', 'python', 'aws', 'databricks', 'spark'...","{'cloud': ['aws', 'databricks'], 'libraries': ..."


5. Get the first five columns of the DataFrame and all of the rows.

In [8]:
df.iloc[:, :5]

Unnamed: 0,job_title_short,job_title,job_location,job_via,job_schedule_type
0,Data Engineer,DevOps Engineer - Big Data/Advanced Analytics,"Bari, Metropolitan City of Bari, Italy",via LinkedIn,Full-time
1,Data Engineer,"Co-Op/Intern Software Engineer, Data Ingestion","Vancouver, BC, Canada",via LinkedIn,Full-time
2,Data Engineer,Hybrid - Data Engineer,"New York, NY",via LinkedIn,Full-time
3,Data Scientist,Data Scientist,"Hamtramck, MI",via BeBee,Full-time
4,Data Engineer,Data Lead Engineer (with strong Python) - Remo...,Anywhere,via Jobgether,Full-time
...,...,...,...,...,...
787681,Data Engineer,Data Engineer (f/m/d),"Heidelberg, Jerman",melalui Top County Careers,Pekerjaan tetap
787682,Business Analyst,PreSales Engineer,"Almaty, Kazakhstan",melalui Melga,Pekerjaan tetap
787683,Senior Data Engineer,Senior Data Engineer,"Berlin, Jerman",melalui Top County Careers,Pekerjaan tetap
787684,Business Analyst,Senior Sales Engineer,"Jenewa, Swiss",melalui BeBee Schweiz,Pekerjaan tetap


## loc

### Notes

* `df.loc[]`: Select rows and columns by position or label.
* Similar to `df.iloc[]` except we can use labels instead.

### Example

Let's get the same columns we did before but with `loc` instead. Which uses column and row labels. These are:

1. Get the first row (index 0)
2. Get the first 10 rows of `job_skills` and `job_type_skills`.
3. Get the first 5 columns and rows 10-20.
4. Get the first 12 rows of the DataFrame.
5. Get the first five columns of the DataFrame and all of the rows.

1. Get the first row. This remains the same because the row doesn't have a label.


In [19]:
df.loc[0]

job_title_short                                              Data Engineer
job_title                    DevOps Engineer - Big Data/Advanced Analytics
job_location                        Bari, Metropolitan City of Bari, Italy
job_via                                                       via LinkedIn
job_schedule_type                                                Full-time
job_work_from_home                                                   False
search_location                                                      Italy
job_posted_date                                        2023-10-24 18:16:15
job_no_degree_mention                                                 True
job_health_insurance                                                 False
job_country                                                          Italy
salary_rate                                                            NaN
salary_year_avg                                                        NaN
salary_hour_avg          

2. Get the first 10 rows of `job_skills` and `job_type_skills`

In [27]:
df.loc[:9,['job_skills','job_type_skills']]


Unnamed: 0,job_skills,job_type_skills
0,"['shell', 'python', 'bash', 'mongodb', 'mongod...","{'cloud': ['gcp', 'aws', 'azure'], 'databases'..."
1,"['java', 'c#', 'mysql', 'oracle', 'docker', 'k...","{'cloud': ['oracle'], 'databases': ['mysql'], ..."
2,['python'],{'programming': ['python']}
3,"['sql', 'r', 'scala', 'java', 'gcp', 'aws', 'h...","{'analyst_tools': ['tableau', 'sap', 'word'], ..."
4,"['python', 'aws', 'tableau', 'looker', 'terraf...","{'analyst_tools': ['tableau', 'looker'], 'asyn..."
5,"['python', 'nosql', 'sql', 'aws', 'redshift', ...","{'cloud': ['aws', 'redshift'], 'other': ['flow..."
6,"['sql', 'python', 'java', 'scala', 'aws', 'spa...","{'cloud': ['aws'], 'libraries': ['spark'], 'ot..."
7,"['c', 'python', 'github', 'zoom']","{'other': ['github'], 'programming': ['c', 'py..."
8,"['sql', 'python', 'aws', 'databricks', 'spark'...","{'cloud': ['aws', 'databricks'], 'libraries': ..."
9,"['sql', 'azure', 'spark']","{'cloud': ['azure'], 'libraries': ['spark'], '..."


In [30]:
df.loc[:9][['job_skills','job_type_skills']]

Unnamed: 0,job_skills,job_type_skills
0,"['shell', 'python', 'bash', 'mongodb', 'mongod...","{'cloud': ['gcp', 'aws', 'azure'], 'databases'..."
1,"['java', 'c#', 'mysql', 'oracle', 'docker', 'k...","{'cloud': ['oracle'], 'databases': ['mysql'], ..."
2,['python'],{'programming': ['python']}
3,"['sql', 'r', 'scala', 'java', 'gcp', 'aws', 'h...","{'analyst_tools': ['tableau', 'sap', 'word'], ..."
4,"['python', 'aws', 'tableau', 'looker', 'terraf...","{'analyst_tools': ['tableau', 'looker'], 'asyn..."
5,"['python', 'nosql', 'sql', 'aws', 'redshift', ...","{'cloud': ['aws', 'redshift'], 'other': ['flow..."
6,"['sql', 'python', 'java', 'scala', 'aws', 'spa...","{'cloud': ['aws'], 'libraries': ['spark'], 'ot..."
7,"['c', 'python', 'github', 'zoom']","{'other': ['github'], 'programming': ['c', 'py..."
8,"['sql', 'python', 'aws', 'databricks', 'spark'...","{'cloud': ['aws', 'databricks'], 'libraries': ..."
9,"['sql', 'azure', 'spark']","{'cloud': ['azure'], 'libraries': ['spark'], '..."


3. Get the first 5 columns and rows 10-20.

In [29]:
df.loc[10:20,'job_title_short':'job_work_from_home']

Unnamed: 0,job_title_short,job_title,job_location,job_via,job_schedule_type,job_work_from_home
10,Data Scientist,"Data Scientist, Senior (Experimentation)",Anywhere,via Jobgether,Full-time,True
11,Data Analyst,Data Architect,"Toronto, ON, Canada",via LinkedIn,Full-time,False
12,Data Engineer,Informatiker als Data Engineer / Data Platform...,"Munich, Germany",via My ArkLaMiss Jobs,Full-time,False
13,Data Scientist,Gartner - Director - Data Science,"Gurugram, Haryana, India",via BeBee India,Full-time,False
14,Senior Data Scientist,Senior Data Scientist with Security Clearance,"Springfield, MO",via Trabajo.org,Full-time,False
15,Senior Data Scientist,Senior Data Scientist Model Performance,"Minneapolis, MN",via WJTV Jobs,Full-time,False
16,Data Scientist,Data Scientist,Anywhere,via LinkedIn,Full-time,True
17,Data Engineer,GCP Data Engineer,Anywhere,via LinkedIn,Contractor,True
18,Senior Data Engineer,"Senior Engineer, Data Insights (Remote)",Anywhere,via ZipRecruiter,Full-time,True
19,Data Engineer,Data Engineer- Sr Jobs,"Washington, DC",via Clearance Jobs,Full-time,False
