<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Python_Data_Analytics_Course/blob/main/1_Basics/23_Pandas_Intro.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Pandas Intro

## Overview

### Notes

* **Pandas** is a Python library used for working with data sets.
* It lets us analyze, clean, explore and manipulate data.
* Pandas primarily uses two data structures to store data:
    * **Series**: A one-dimensional array with data labels, called its index, capable of holding any data type. It's like a column in a spreadsheet.
    * **DataFrame**: A two-dimensional, mutable table with labeled axes (rows and columns). It resembles a spreadsheet or SQL table and can contain multiple Series objects of different data types.

### Importance

* Pandas is one of the most popular and most used library for working with data is Pandas.
* It provides functions for data manipulation, from simple data aggregation to complex merging and joining of datasets.
* Lets us analyze big data and use statistics.
* Works well with other Python libraries, enhancing its functionality for numerical computations and visualizations.

### Import

* First before we even use the library we have to load it in our environment and import it using the `import` command.
* Also once we import it we are going to rename it to `pd`, which is a common alias we give a library. It makes it easier to type out instead of having to type out `pandas` every time. It's technically optional, but it is convention to do this.

NOTE: If running this locally with conda you need to install pandas (ignore if running in Colab).

In [5]:
# !conda install pandas 

import pandas as pd

## Loading Data

### Notes

- We can load data from a CSV using `pd.read_csv()`
- We can also load data from an Excel file using `pd.read_excel()`

In [3]:
# pd.read_csv('path_to_file.csv')

In [4]:
# pd.read_excel('path_to_file.xlsx')

### Example

This lets us run the code in Collab Notebooks without any troubles, but you will have to load and install the `datasets` library every time you open a new session.

NOTE: In Colab or if running locally, you need to install datasets library first.

In [6]:
# !conda install datasets

# OR

# !pip install datasets

To load in the latest dataset use this code. 

In [1]:
from datasets import load_dataset

# Loading Data
dataset = load_dataset('lukebarousse/data_jobs')
df = dataset['train'].to_pandas()

  from .autonotebook import tqdm as notebook_tqdm


**Note:** For the rest of the sections in this course we'll be doing some exploratory data analysis (EDA) to learn more about the dataset we have. This is a crucial first step so we understand the dataset before doing more advanced analysis and visualization.

## DataFrames Intro

#### What is it?

- A DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.
- It's like a table in a relational database or a spreadsheet.

#### Rows and Columns
- DataFrames consist of rows and columns, where each row represents a single observation or record, and each column represents a variable or feature.

In [2]:
df

Unnamed: 0,job_title_short,job_title,job_location,job_via,job_schedule_type,job_work_from_home,search_location,job_posted_date,job_no_degree_mention,job_health_insurance,job_country,salary_rate,salary_year_avg,salary_hour_avg,company_name,job_skills,job_type_skills
0,Data Analyst,Data Analytics,"Monterrey, Nuevo Leon, Mexico",via BeBee,Part-time,False,Mexico,2023-12-05 07:26:05,False,False,Mexico,,,,2U Bootcamps Instructional Engagement,"['go', 'python', 'mongodb', 'mongodb', 'css', ...","{'analyst_tools': ['tableau'], 'databases': ['..."
1,Data Scientist,Data Scientist Intern,"Lisbon, Portugal",via Empregos Trabajo.org,Full-time,False,Portugal,2023-08-20 07:46:51,False,False,Portugal,,,,Nokia,"['sql', 'python', 'sql server', 'oracle', 'azu...","{'analyst_tools': ['sap'], 'cloud': ['oracle',..."
2,Data Analyst,"Manager, Data Analytics","Guanacaste Province, Lagunilla, Costa Rica",via BeBee Costa Rica,Full-time,False,Costa Rica,2023-11-21 08:37:14,False,False,Costa Rica,,,,Thermo Fisher Scientific,,
3,Data Engineer,Data Engineer,"Lambayeque, Peru",via BeBee Perú,Full-time,False,Peru,2023-11-21 07:36:33,True,False,Peru,,,,Emprego,,
4,Data Analyst,Technical Data Analyst,"Fairfax, VA",via Indeed,Contractor,False,"New York, United States",2023-12-20 07:00:10,True,False,United States,,,,Info Origin Inc.,"['sql', 'python', 'jira']","{'async': ['jira'], 'programming': ['sql', 'py..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
787681,Senior Data Engineer,(Senior) Data Engineer (m/w/d) - PUBLIC BW / C...,"Leinfelden-Echterdingen, Jerman",melalui Monster.de,Pekerjaan tetap,False,Germany,2023-03-13 06:18:59,False,False,Germany,,,,"CGI Group, Inc.","['python', 'sql', 'azure', 'aws', 'hadoop', 's...","{'cloud': ['azure', 'aws'], 'libraries': ['had..."
787682,Data Analyst,Lead Data Analyst,Jerman,melalui BeBee Deutschland,Pekerjaan tetap,False,Germany,2023-03-12 06:18:18,False,False,Germany,,,,Amtrak,"['vba', 'sql', 'python', 'excel', 'sap', 'shar...","{'analyst_tools': ['excel', 'sap', 'sharepoint..."
787683,Data Engineer,Lead Data Engineer,"Frankfurt am Main, Jerman",melalui Top County Careers,Pekerjaan tetap,False,Germany,2023-03-13 06:18:59,False,False,Germany,,,,Tiro Partners Limited,"['python', 'sql', 'scala', 'java', 'javascript...","{'analyst_tools': ['sas', 'tableau', 'power bi..."
787684,Software Engineer,Lead Solutions Design Engineer,"San Juan, Puerto Riko",melalui BeBee Puerto Rico,Pekerjaan tetap,False,Puerto Rico,2023-03-12 06:31:19,False,False,Puerto Rico,,,,Ryder,"['excel', 'powerpoint', 'tableau']","{'analyst_tools': ['excel', 'powerpoint', 'tab..."


#### Column Names

- Column names provide labels for each column in the DataFrame
- They allow for easy reference and manipulation of data

In [3]:
df['job_title_short']

0                 Data Analyst
1               Data Scientist
2                 Data Analyst
3                Data Engineer
4                 Data Analyst
                  ...         
787681    Senior Data Engineer
787682            Data Analyst
787683           Data Engineer
787684       Software Engineer
787685    Senior Data Engineer
Name: job_title_short, Length: 787686, dtype: object

Note: you get info on column name, length, and dtype.

You can also access column with dot notation.

In [5]:
df.job_title_short

0                 Data Analyst
1               Data Scientist
2                 Data Analyst
3                Data Engineer
4                 Data Analyst
                  ...         
787681    Senior Data Engineer
787682            Data Analyst
787683           Data Engineer
787684       Software Engineer
787685    Senior Data Engineer
Name: job_title_short, Length: 787686, dtype: object

#### Index
- DataFrames have an index, which provides a label for each row. By default, it's a sequence of integers starting from 0, but it can be customized.

In [11]:
# Access a row by index
df.job_title_short[787685]

'Senior Data Engineer'

#### Data Inspection (upcoming)

Dataframes have a ton of methods to inspect them!

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 787686 entries, 0 to 787685
Data columns (total 17 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   job_title_short        787686 non-null  object 
 1   job_title              787685 non-null  object 
 2   job_location           786646 non-null  object 
 3   job_via                787679 non-null  object 
 4   job_schedule_type      774976 non-null  object 
 5   job_work_from_home     787686 non-null  bool   
 6   search_location        787686 non-null  object 
 7   job_posted_date        787686 non-null  object 
 8   job_no_degree_mention  787686 non-null  bool   
 9   job_health_insurance   787686 non-null  bool   
 10  job_country            787633 non-null  object 
 11  salary_rate            33073 non-null   object 
 12  salary_year_avg        22026 non-null   float64
 13  salary_hour_avg        10649 non-null   float64
 14  company_name           787668 non-nu