## Week 4 - Pandas Basics

### Outline:

1. Introduction to Pandas\
    a. Importing Libraries\
    b. Loading Data
    
2. Exploring Your Data

2. Accessing Data

3. Filtering Data




## Introduction to Pandas


__Pandas__ is a powerful and popular __Python__ __library__ for data manipulation and analysis. It provides data structures and functions that allow you to work with structured data, such as spreadsheets and SQL tables, allowing us to make data-driven decisions.

We need to import the pandas library into our environment in order to use it below!

We'll also use another essential library called __NumPy__ that is used for numerical and mathematical operations in Python, which is oftentimes used in conjunction with pandas for data analysis.



In [None]:
import pandas as pd
import numpy as np

## Loading Data

 Pandas provides a convenient way to load data from various sources, such as CSV files, Excel spreadsheets, and databases. Here, we'll focus on loading a CSV file into a DataFrame. 

A CSV file, which stands for "Comma-Separated Values," is a plain text file format used to store tabular data, such as spreadsheets or database tables. It's used widely for data interchange, as it's simple, human-readable, and easily processed by computer programs.

To load a CSV file, you can use the 'read_csv()' function as below:

In [None]:
df = pd.read_csv('sample_dataset.csv')

In [None]:
df

### DataFrames and Series

In pandas, a __DataFrame__ is a two-dimensional, table-like data structure that consists of rows and columns, similar to a spreadsheet or a table. Each column in a DataFrame is called a __Series__, which is a one-dimensional array.

Each row is an instance of a person with information!

## Exploring Your Data 

After loading your dataset into a DataFrame, you might want to explore and understand the data better. Pandas provides a few methods to do this through a few different methods:
- .shape
- .info()
- .describe()

__'shape'__ gives you the dimensions of your dataset, __'.info()'__ provides a concise summary of your data, and __'.describe()'__ generates summary statistics for numeric columns. These methods are essential for understanding your dataset and getting insights into its structure and content.

The __.shape__ attribute returns a tuple (ordered sequence of elements) representing dimensions of your data.
The first element is the number of rows and the second element is the number of columns.


In [None]:
df_column_row = df.shape
df_column_row

The __.info()__ method provides a concise summary of your DataFrame, including information about the data types of each column, the number of non-null values, and the memory usage.

It's especially useful for quickly checking data types and identifying missing values, which we want to clean out eventually before doing more with our data.

In [None]:
df.info()

The __.describe()__ method generates __summary statistics__ for all numeric columns in your DataFrame. It provides key statistics, including count, mean, standard deviation, minimum, 25th percentile, median (50th percentile), 75th percentile, and maximum.

This is useful for understanding the central tendencies and distribution of your data!


In [None]:
df_summary = df.describe()
df_summary

How about if you want to see the __first few rows__ or the __last few rows__ of a DataFrame? We can use the __head()__ or __tail()__ methods. This is useful for a quick look at your data:

In [None]:
first_few_rows = df.head()

first_few_rows

In [None]:
last_few_rows = df.tail()
last_few_rows

## Accessing Data

In pandas, you can access a specific column in a DataFrame by using square brackets with the column name (sound familiar?). For example, if you have a DataFrame named 'df' and you want to access the 'Job Salary' column, you can do it like this:



In [None]:
job_salary = df['Job_Salary']

job_salary

Once we have found and accessed our column of interest, we can then use a variety of operations on the columns to extract statistical information by using methods like:
- .mean()
- .std()

In [None]:
job_salary.mean()
job_salary.std()

## Practice 1


Please load the contents of column 'State' into a variable called "sample_state".

Please load the contents of column 'Age' into a variable called "sample_age".

What is the average age of the people in our dataset?

31.083333333333332

What is the standard deviation in the job salaries that we have in our dataset?

## Filtering Data

Data filtering is a crucial part of data analysis in pandas. It involves selecting or removing rows from a DataFrame based on specific conditions or criteria, which allows you to focus on the data that is relevant to your analysis while excluding irrelevant or unwanted data.

The syntax for selecting and removing rows based on conditions is through using a Boolean mask that specifies the condition you want to apply to the DataFrame:

In [None]:
df1 = df[condition]

In this syntax:
- __df__ is the DataFrame that you are working with
- __condition__ is a Boolean expression that defines the condition you want to apply. Rows for which the condition evaluates to True are retained, and rows for which the condition evaluates to False are removed.

In [None]:
# FIND people whose 'Age' is greater than or equal to 25:

df_age = df[df['Age'] >= 30] 
df_age

In [None]:
# FIND the people who live in Texas

df_state = df[df['State'] == 'Texas'] 
df_state

In [None]:
#Note the same outputs!! Another way to write the above:

df_state1 = df[df.State == 'Texas']
df_state1

## Compound Filtering Using | (OR) and & (AND) Operators:

In more complex filtering scenarios, you might need to combine multiple conditions using logical operators. You can use the | (OR) and & (AND) operators for compound filtering.

In [None]:
# Example 3: Select rows where the 'Age' is either less than 18 or the 'State' is 'California'
filtered_data = df[(df['Age'] < 18) | (df['State'] == 'California')]
filtered_data

In this example, we create a DataFrame 'filtered_data' by selecting rows where either the 'Age' is less than 18 or the 'State' is 'California. The __|__ operator is used for the __OR condition__.

But...you can also use the __&__ operator for __AND conditions__:

In [None]:
# Example 4: Select rows where the 'Age' is greater than 25 and the 'Job Salary' is above 60000
filtered_data2 = df[(df['Age'] > 25) & (df['Job_Salary'] > 60000)]
filtered_data2

In this case, 'filtered_data' includes rows where both conditions are met: 'Age' is greater than 25 and 'Job Salary' is above 60000.

## Practice 2

Select rows where the Age is less than 15.

Unnamed: 0,Name,Age,State,Job_Title,Job_Salary


Select rows where the 'Age' is greater than 20 or the 'Job Salary' is lower than 85000

Unnamed: 0,Name,Age,State,Job_Title,Job_Salary
0,Alice,28,California,Data Analyst,65000
1,Bob,35,Texas,Software Engineer,80000
4,Eve,26,California,Graphic Designer,48000
5,Frank,30,Texas,Data Scientist,75000
6,Grace,33,New York,Web Developer,70000
7,Hannah,29,Florida,Financial Analyst,60000
10,Liam,29,Texas,Software Developer,60000
11,Sophia,32,Florida,UI/UX Designer,55000
12,Oliver,26,New York,Accountant,65000
13,Ava,27,Texas,HR Manager,75000


Select rows where the Age' is greater than 20 and less than 30.

Unnamed: 0,Name,Age,State,Job_Title,Job_Salary
0,Alice,28,California,Data Analyst,65000
4,Eve,26,California,Graphic Designer,48000
7,Hannah,29,Florida,Financial Analyst,60000
10,Liam,29,Texas,Software Developer,60000
12,Oliver,26,New York,Accountant,65000
13,Ava,27,Texas,HR Manager,75000
