# Week 10: Introduction to Pandas – Data Manipulation and Analysis

In this demo, we will explore the basics of pandas, a powerful Python library designed for data manipulation and analysis.  
We will go through creating DataFrames, selecting data, cleaning, and summarizing data, making it suitable for real-world datasets.

---

# 1. Import pandas

Let's start by importing pandas.

In [1]:
import pandas as pd

*Try it:*  
*In your notebook, run the import.*

---

# 2. Creating DataFrames

Unlike NumPy arrays, pandas DataFrames are tabular data structures with labeled columns and rows.

### a) From a dictionary

In [2]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'Salary': [50000, 60000, 55000, 65000],
}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Salary
0,Alice,25,50000
1,Bob,30,60000
2,Charlie,35,55000
3,David,40,65000


*Prompt:*  
*Try creating your own small DataFrame from a dictionary with different columns.*

---

### b) From a CSV file

In [3]:
# Uncomment and run if you have a data file:
# df = pd.read_csv('your_data.csv')

(*Note: for practice, you can use built-in datasets or CSVs.*)

---

# 3. Viewing Data

### a) Preview the data

In [4]:
df.head()  # first 5 rows

Unnamed: 0,Name,Age,Salary
0,Alice,25,50000
1,Bob,30,60000
2,Charlie,35,55000
3,David,40,65000


### b) Get info about data types and missing values

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    4 non-null      object
 1   Age     4 non-null      int64 
 2   Salary  4 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 228.0+ bytes


### c) Summary statistics

In [6]:
df.describe()  # numeric columns

Unnamed: 0,Age,Salary
count,4.0,4.0
mean,32.5,57500.0
std,6.454972,6454.972244
min,25.0,50000.0
25%,28.75,53750.0
50%,32.5,57500.0
75%,36.25,61250.0
max,40.0,65000.0


*Try:*  
*Use your own dataset or the sample above; explore the info and describe methods.*

---

# 4. Selecting Data

### a) Select a column

In [7]:
df['Age']

0    25
1    30
2    35
3    40
Name: Age, dtype: int64

### b) Select multiple columns

In [8]:
df[['Name', 'Salary']]

Unnamed: 0,Name,Salary
0,Alice,50000
1,Bob,60000
2,Charlie,55000
3,David,65000


### c) Select rows by position

In [9]:
df.iloc[0]  # first row
df.iloc[0:2]  # first two rows

Unnamed: 0,Name,Age,Salary
0,Alice,25,50000
1,Bob,30,60000


### d) Select rows by label

In [10]:
df.loc[0]  # index label 0
df.loc[0:2]  # rows with labels 0 to 2

Unnamed: 0,Name,Age,Salary
0,Alice,25,50000
1,Bob,30,60000
2,Charlie,35,55000


*Prompt:*  
*Try selecting a different row or column in your dataset.*

---

# 5. Filtering Data

Select rows based on conditions:

In [11]:
# Age greater than 30
df[df['Age'] > 30]

Unnamed: 0,Name,Age,Salary
2,Charlie,35,55000
3,David,40,65000


Choose multiple conditions:

In [12]:
# Age > 30 and Salary > 55000
df[(df['Age'] > 30) & (df['Salary'] > 55000)]

Unnamed: 0,Name,Age,Salary
3,David,40,65000


*Try:*  
*Create your own filtering based on different conditions.*

---

# 6. Sorting Data

Sort by a column:

In [13]:
# Sort by Salary descending
df.sort_values('Salary', ascending=False)

Unnamed: 0,Name,Age,Salary
3,David,40,65000
1,Bob,30,60000
2,Charlie,35,55000
0,Alice,25,50000


---

# 7. Cleaning Data

### a) Handling missing data

In [14]:
# Fill missing values
df['Salary'].fillna(0, inplace=True)

# Drop rows with missing data
df.dropna(inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Salary'].fillna(0, inplace=True)


### b) Adding new columns

In [15]:
# For example, add a column for salary after tax (assume 20% tax)
df['Salary_after_tax'] = df['Salary'] * 0.8

*Prompt:*  
*Try creating a new column based on existing ones.*

---

# 8. Grouping and aggregating data

Suppose you have data for multiple departments. You might want average salary per department.

### a) Example Data

In [16]:
# Sample data
data = {
    'Department': ['HR', 'Sales', 'HR', 'Sales', 'IT', 'IT'],
    'Salary': [50000, 60000, 52000, 61000, 55000, 58000],
}
df_dept = pd.DataFrame(data)

### b) Group by department and mean salary

In [17]:
grouped = df_dept.groupby('Department')['Salary'].mean()
grouped

Department
HR       51000.0
IT       56500.0
Sales    60500.0
Name: Salary, dtype: float64

*Try:*  
*Apply groupby on your dataset for other summaries.*

---

# 9. Pivot Tables

Pivot tables allow re-arranging data:

In [18]:
# Example data
data = {
    'Region': ['North', 'South', 'North', 'South'],
    'Gender': ['Male', 'Male', 'Female', 'Female'],
    'Sales': [100, 150, 200, 250],
}
df_pivot = pd.DataFrame(data)

# Pivot table: mean sales by region and gender
pivot_table = pd.pivot_table(df_pivot, values='Sales', index='Region', columns='Gender', aggfunc='mean')
pivot_table

Gender,Female,Male
Region,Unnamed: 1_level_1,Unnamed: 2_level_1
North,200.0,100.0
South,250.0,150.0


*Prompt:*  
*Try creating a pivot table on your data with different groupings.*

---

# 10. Saving and exporting data

Write your processed DataFrame to a CSV:

In [19]:
df.to_csv('processed_data.csv', index=False)

---

# Summary:

- pandas tables are flexible and labeled
- You can select, filter, group, aggregate, and pivot data
- Important for cleaning, summarizing, and reporting data

---

# Next step:
- Try applying these functions to your own datasets
- Use pandas to prepare data for report writing

---

# Further Practice:

- Explore datasets from Kaggle or other sources
- Experiment with creating and exporting your own reports

---

# End of demo. Keep practicing!  
# pandas makes data manipulation easier and more powerful!