# Data Manipulation with NumPy and Pandas

Welcome to the second notebook of the **Python for Data Science** course! In this notebook, we will explore two powerful libraries: **NumPy** and **Pandas**. These libraries are essential for data manipulation and analysis in Python.

## 1. Introduction to NumPy

NumPy (Numerical Python) is a library used for working with arrays and performing numerical computations. It's the foundation for many other data science libraries in Python.

### 1.1. Installing NumPy

Before we begin, let's ensure that NumPy is installed. If it's not installed, you can install it using pip.

In [17]:
# Install NumPy if it's not already installed
%pip install numpy


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### 1.2. Importing NumPy

Let's start by importing the NumPy library and checking its version.

In [18]:
import numpy as np

# Check NumPy version
print("NumPy version:", np.__version__)

NumPy version: 2.1.0


### 1.3. Creating NumPy Arrays

NumPy arrays are the central data structure of the library. You can create arrays from Python lists using the `np.array()` function.

In [19]:
# Create a 1D NumPy array
array_1d = np.array([1, 2, 3, 4, 5])
print("1D Array:", array_1d)

# Create a 2D NumPy array
array_2d = np.array([[1, 2, 3], [4, 5, 6]])
print("2D Array:\n", array_2d)


1D Array: [1 2 3 4 5]
2D Array:
 [[1 2 3]
 [4 5 6]]


### 1.4. Array Operations

NumPy allows you to perform element-wise operations on arrays.


In [20]:
# Element-wise operations
array_sum = array_1d + 5
array_product = array_1d * 2

print("Array Sum:", array_sum)
print("Array Product:", array_product)


Array Sum: [ 6  7  8  9 10]
Array Product: [ 2  4  6  8 10]


### 1.5. Array Indexing and Slicing

You can access elements in a NumPy array using indexing, similar to Python lists. NumPy also supports slicing to access subarrays.


In [21]:
# Indexing
print("First element:", array_1d[0])

# Slicing
print("First three elements:", array_1d[:3])

# Indexing in 2D array
print("Element at row 1, column 2:", array_2d[0, 1])


First element: 1
First three elements: [1 2 3]
Element at row 1, column 2: 2


### 1.6. Array Shape and Reshaping

The shape of an array is a tuple representing the number of elements along each axis. You can change the shape of an array using the `reshape()` function.


In [22]:
# Array shape
print("Shape of 1D array:", array_1d.shape)
print("Shape of 2D array:", array_2d.shape)

# Reshaping an array
array_reshaped = array_1d.reshape(5, 1)
print("Reshaped array:\n", array_reshaped)


Shape of 1D array: (5,)
Shape of 2D array: (2, 3)
Reshaped array:
 [[1]
 [2]
 [3]
 [4]
 [5]]


### 1.7. Basic Mathematical Functions

NumPy provides many built-in functions for performing mathematical operations, such as `np.mean()`, `np.sum()`, and `np.max()`.


In [23]:
# Basic mathematical operations
array_mean = np.mean(array_1d)
array_sum = np.sum(array_1d)
array_max = np.max(array_1d)

print("Mean of array:", array_mean)
print("Sum of array:", array_sum)
print("Max of array:", array_max)


Mean of array: 3.0
Sum of array: 15
Max of array: 5


---

## 2. Introduction to Pandas

Pandas is a powerful library for data manipulation and analysis. It provides two main data structures: **Series** and **DataFrame**.

### 2.1. Installing Pandas

If Pandas is not installed, you can install it using pip.


In [24]:
# Install Pandas if it's not already installed
%pip install pandas


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### 2.2. Importing Pandas

Let's import the Pandas library and check its version.


In [25]:
import pandas as pd

# Check Pandas version
print("Pandas version:", pd.__version__)


Pandas version: 2.2.2


### 2.3. Pandas Series

A Pandas Series is a one-dimensional array-like object containing an array of data and an associated array of data labels (index).

Let's create a simple Series.


In [26]:
# Creating a Pandas Series
data = [10, 20, 30, 40]
index = ['a', 'b', 'c', 'd']
series = pd.Series(data, index=index)
print("Pandas Series:\n", series)


Pandas Series:
 a    10
b    20
c    30
d    40
dtype: int64


### 2.4. Pandas DataFrame

A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns).

Let's create a DataFrame.


In [27]:
# Creating a Pandas DataFrame
data = {
    "Name": ["Alice", "Bob", "Charlie", "David"],
    "Age": [24, 27, 22, 32],
    "City": ["New York", "Los Angeles", "Chicago", "Houston"]
}
df = pd.DataFrame(data)
print("Pandas DataFrame:\n", df)


Pandas DataFrame:
       Name  Age         City
0    Alice   24     New York
1      Bob   27  Los Angeles
2  Charlie   22      Chicago
3    David   32      Houston


### 2.5. Accessing Data in a DataFrame

You can access individual columns, rows, and cells in a DataFrame using labels and indices.


In [28]:
# Accessing a column
print("Name column:\n", df["Name"])

# Accessing a row by index
print("Row 1:\n", df.iloc[0])

# Accessing a cell by row and column
print("Cell (0, 1):", df.iloc[0, 1])


Name column:
 0      Alice
1        Bob
2    Charlie
3      David
Name: Name, dtype: object
Row 1:
 Name       Alice
Age           24
City    New York
Name: 0, dtype: object
Cell (0, 1): 24


### 2.6. DataFrame Operations

Pandas allows you to perform various operations on DataFrames, such as filtering, adding new columns, and deleting columns.


In [29]:
# Filtering rows
filtered_df = df[df["Age"] > 25]
print("Filtered DataFrame:\n", filtered_df)

# Adding a new column
df["Country"] = ["USA", "USA", "USA", "USA"]
print("DataFrame with new column:\n", df)

# Deleting a column
df = df.drop("Country", axis=1)
print("DataFrame after deleting a column:\n", df)

Filtered DataFrame:
     Name  Age         City
1    Bob   27  Los Angeles
3  David   32      Houston
DataFrame with new column:
       Name  Age         City Country
0    Alice   24     New York     USA
1      Bob   27  Los Angeles     USA
2  Charlie   22      Chicago     USA
3    David   32      Houston     USA
DataFrame after deleting a column:
       Name  Age         City
0    Alice   24     New York
1      Bob   27  Los Angeles
2  Charlie   22      Chicago
3    David   32      Houston


### 2.7. Handling Missing Data

Pandas provides functions to detect, fill, and drop missing data.

In [30]:
# Handling missing data
df_with_nan = df.copy()
df_with_nan.loc[1, "Age"] = None

# Detecting missing values
print("Missing values:\n", df_with_nan.isna())

# Filling missing values
df_filled = df_with_nan.fillna(0)
print("DataFrame with filled missing values:\n", df_filled)

# Dropping rows with missing values
df_dropped = df_with_nan.dropna()
print("DataFrame after dropping rows with missing values:\n", df_dropped)


Missing values:
     Name    Age   City
0  False  False  False
1  False   True  False
2  False  False  False
3  False  False  False
DataFrame with filled missing values:
       Name   Age         City
0    Alice  24.0     New York
1      Bob   0.0  Los Angeles
2  Charlie  22.0      Chicago
3    David  32.0      Houston
DataFrame after dropping rows with missing values:
       Name   Age      City
0    Alice  24.0  New York
2  Charlie  22.0   Chicago
3    David  32.0   Houston


### 2.8. Loading and Saving Data

Pandas can load data from various file formats such as CSV, Excel, and SQL databases, and save DataFrames back to these formats.


In [31]:
# Example: Loading data from a CSV file
# Replace 'your_file.csv' with an actual file path
#df_from_csv = pd.read_csv("your_file.csv")

# Example: Saving a DataFrame to a CSV file
#df.to_csv("output.csv", index=False)

---

## 3. Data Manipulation Techniques

Let's explore some common data manipulation techniques in Pandas, such as grouping, merging, and pivoting.

### 3.1. Grouping Data

The `groupby()` function allows you to group data based on one or more columns and perform aggregate functions on the grouped data.


In [32]:
# Example of grouping data
grouped = df.groupby("City").mean()
print("Grouped DataFrame:\n", grouped)


TypeError: agg function failed [how->mean,dtype->object]

### 3.2. Merging DataFrames

You can merge/join two DataFrames using the `merge()` function, similar to SQL joins.


In [None]:
# Example DataFrames for merging
df1 = pd.DataFrame({
    "ID": [1, 2, 3],
    "Name": ["Alice", "Bob", "Charlie"]
})
df2 = pd.DataFrame({
    "ID": [1, 2, 4],
    "City": ["New York", "Los Angeles", "Chicago"]
})

# Merging DataFrames on a common column
merged_df = pd.merge(df1, df2, on="ID", how="inner")
print("Merged DataFrame:\n", merged_df)


### 3.3. Pivoting DataFrames

Pivoting is a technique to reshape the data. The `pivot()` function is used to create a new DataFrame from the existing data.


In [None]:
# Example of pivoting data
pivot_df = df.pivot(index="Name", columns="City", values="Age")
print("Pivoted DataFrame:\n", pivot_df)

---

## 4. Summary and Next Steps

In this notebook, we covered the basics of NumPy and Pandas, focusing on array operations, DataFrames, and common data manipulation techniques.

### Next Steps:

- Experiment with the concepts you've learned here.
- Try loading real-world datasets and perform basic analyses using NumPy and Pandas.

In the next notebook, we'll dive deeper into data visualization techniques using Matplotlib and Seaborn.