## Pandas
### Introduction to pandas objects
A series is a 1D array of indexed data
- list = 10,20,30,40
- index = 0,1,2,3
- pdSeries([10,20,30,40])

In [2]:
import pandas as pd

In [3]:
# creating a series
list = [10,20,30,40, 50]
list

[10, 20, 30, 40, 50]

In [4]:
pd.Series(list)

0    10
1    20
2    30
3    40
4    50
dtype: int64

## DataFrame
A collection of series that share the same index.
- It is 2D, it has columns and Rows.

In [5]:
#creating a dataframe
data = {
    "Name": ["John", "Reagan", "Nathan", "Abdullahi", "Bethany"],
    "Score":[85,90,78,88,92],
    "Age": [20,21,19,22,20],
    "Location": ["Juja", "Utawala", "Uthiru", "Kilimani", "Membley"]
}  
data

{'Name': ['John', 'Reagan', 'Nathan', 'Abdullahi', 'Bethany'],
 'Score': [85, 90, 78, 88, 92],
 'Age': [20, 21, 19, 22, 20],
 'Location': ['Juja', 'Utawala', 'Uthiru', 'Kilimani', 'Membley']}

In [7]:
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Score,Age,Location
0,John,85,20,Juja
1,Reagan,90,21,Utawala
2,Nathan,78,19,Uthiru
3,Abdullahi,88,22,Kilimani
4,Bethany,92,20,Membley


### viewing and accessing data

In [8]:
df.head(2)

Unnamed: 0,Name,Score,Age,Location
0,John,85,20,Juja
1,Reagan,90,21,Utawala


In [9]:
df.tail(2)

Unnamed: 0,Name,Score,Age,Location
3,Abdullahi,88,22,Kilimani
4,Bethany,92,20,Membley


In [None]:
# summary
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Name      5 non-null      object
 1   Score     5 non-null      int64 
 2   Age       5 non-null      int64 
 3   Location  5 non-null      object
dtypes: int64(2), object(2)
memory usage: 292.0+ bytes


In [None]:
#statistics
df.describe()

Unnamed: 0,Score,Age
count,5.0,5.0
mean,86.6,20.4
std,5.458938,1.140175
min,78.0,19.0
25%,85.0,20.0
50%,88.0,20.0
75%,90.0,21.0
max,92.0,22.0


In [15]:
#Accesing columns
df.Name

0         John
1       Reagan
2       Nathan
3    Abdullahi
4      Bethany
Name: Name, dtype: object

# data Indexing and Selection
loc() - (label- based indexing)
- label names or rows
- can handle non integer labels such as strings
- when slicings it includes both the start and the end

iloc() - (interger-location based indexing)
- uses interger position to select data
- ignores the actual labels of rows and columns
- when slicing, it includes the end position


In [17]:
df.loc[0]

Name        John
Score         85
Age           20
Location    Juja
Name: 0, dtype: object

In [26]:
df.loc[0:4]

Unnamed: 0,Name,Score,Age,Location
0,John,85,20,Juja
1,Reagan,90,21,Utawala
2,Nathan,78,19,Uthiru
3,Abdullahi,88,22,Kilimani
4,Bethany,92,20,Membley


In [28]:
df.loc[0,"Name"]

'John'

In [24]:
df.iloc[0,0]

'John'

In [36]:
df.loc[0:2, "Score"]

0    85
1    90
2    78
Name: Score, dtype: int64

In [None]:
# understand loc and iloc
# understand labels


In [38]:
# Accessing score using labels

In [50]:
#creating a dataframe
data = pd.DataFrame({
    "Score":[85,90,78,88,92],
    "Age": [20,21,19,22,20],
    "Location": ["Juja", "Utawala", "Uthiru", "Kilimani", "Membley"]
},
index = ["John", "Reagan", "Nathan", "Abdullahi", "Bethany"])
data

Unnamed: 0,Score,Age,Location
John,85,20,Juja
Reagan,90,21,Utawala
Nathan,78,19,Uthiru
Abdullahi,88,22,Kilimani
Bethany,92,20,Membley


In [56]:
# Accessing a single column
data.loc["John", "Score"]

85

In [62]:
data.iloc[0,1]

20

In [61]:
data.loc[["John", "Nathan"], ["Score", "Age"]]

Unnamed: 0,Score,Age
John,85,20
Nathan,78,19


## **Assignment Combining Data**

**1. Why Combine Data in NumPy?**

In real-world data analysis, data may come in parts — different arrays representing:
- different features (columns),
- different observations (rows),
- or even from different files.

NumPy provides several efficient tools to combine or stack these arrays into a single unified structure.

***2. Methods to Combine Arrays in NumPy***

- np.concatenate()
- np.stack()
- np.hstack() and np.vstack()
- np.column_stack() and np.row_stack()

### i. np.concatenate() – Basic Concatenation
This joins arrays along an existing axis (axis 0 = rows, axis 1 = columns).

In [6]:
b = np.array([[5, 6]])
print(b.T)

[[5]
 [6]]


(1, 2)

In [4]:
# Concatenate by rows (default axis=0)

import numpy as np

a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6]])

result = np.concatenate((a, b.T), axis=1)
print(result)

[[1 2 5]
 [3 4 6]]


In [11]:
a

array([[1, 2],
       [3, 4]])

In [12]:
a.shape

(2, 2)

In [10]:
b.shape

(1, 2)

Arrays must have the same number of columns for axis=0

In [13]:
# Concatenate by columns (axis=1)

a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])

result = np.concatenate((a, b), axis=1)
print(result)

[[1 2 5 6]
 [3 4 7 8]]


Arrays must have the same number of rows for axis=1.

### ii. np.stack() – Combine Along a New Axis
Unlike concatenate(), this adds a new dimension.

In [14]:
a = np.array([1, 2])
b = np.array([3, 4])

stacked = np.stack((a, b), axis=0)  # shape becomes (2, 2)
print(stacked)

[[1 2]
 [3 4]]


### iii. np.hstack() – Horizontal Stack (Column-wise)
Joins arrays horizontally (columns side by side).

In [66]:
a = np.array([1, 2])
b = np.array([3, 4])

result = np.hstack((a, b))
print(result)

[1 2 3 4]


In [68]:
# For 2D
a = np.array([[1], [2]])
b = np.array([[3], [4]])

result = np.hstack((a, b))
print(result)

[[1 3]
 [2 4]]


### iV. np.vstack() – Vertical Stack (Row-wise)
Joins arrays vertically (rows stacked on top of each other).

In [69]:
a = np.array([1, 2])
b = np.array([3, 4])

result = np.vstack((a, b))
print(result)

[[1 2]
 [3 4]]


### v. np.column_stack() – Stack 1D Arrays as Columns

In [70]:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

result = np.column_stack((a, b))
print(result)

[[1 4]
 [2 5]
 [3 6]]


Useful for reshaping multiple 1D arrays into 2D column vectors.

### vi. np.row_stack() – Stack 1D Arrays as Rows

In [71]:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

result = np.row_stack((a, b))
print(result)

[[1 2 3]
 [4 5 6]]


## Summary
| Method               | What It Does                                   | Good For                     |
|----------------------|-----------------------------------------------|------------------------------|
| `np.concatenate()`   | Joins arrays along an existing axis           | General merging              |
| `np.stack()`         | Joins along a new axis                        | Creating new dimensions      |
| `np.hstack()`        | Stacks arrays horizontally (axis=1)           | Combining columns            |
| `np.vstack()`        | Stacks arrays vertically (axis=0)             | Adding new rows              |
| `np.column_stack()`  | Turns 1D arrays into 2D columns               | Building a dataset           |
| `np.row_stack()`     | Turns 1D arrays into 2D rows                  | Appending examples           |

## Assignment - Pandas
### WHY Combine Data in Pandas?

Combining data is a core operation in pandas — used when:
- Adding more rows (observations)
- Adding more columns (features)
- Merging data from different sources (e.g., different files, tables, or APIs)

### Methods of Combining Data

| Method               | What It Does                                  
|----------------------|-----------------------------------------------|
| `concat()`           | Stack DataFrames vertically or horizontally   |
| `append()`           | Add rows to a DataFrame (deprecated)          |
| `merge()`            | SQL-style joins based on key columns          |
| `join()`             | Combine columns by index or key               |
| `combine_first()`    | Fill in missing values from another DataFrame |

In [2]:
import pandas as pd

Students = {
    "Name": ["John", "Reagan", "Nathan", "Abdullahi", "Bethany"],
    "Score": [85, 90, 78, 88, 92],
    "Age": [20, 21, 19, 22, 20],
    "Location": ["Juja", "Utawala", "Uthiru", "Kilimani", "Membley"]
}

data = pd.DataFrame(Students)
data

Unnamed: 0,Name,Score,Age,Location
0,John,85,20,Juja
1,Reagan,90,21,Utawala
2,Nathan,78,19,Uthiru
3,Abdullahi,88,22,Kilimani
4,Bethany,92,20,Membley


### 1. `pd.concat()` – Concatenation (Stacking Rows or Columns)
**What is pd.concat()?**

`pd.concat()` is a Pandas function that concatenates (combines) DataFrames or Series along a particular axis (rows or columns). It's one of Pandas' most versatile data combination tools, allowing you to:
- Stack DataFrames vertically (row-wise)
- Combine DataFrames horizontally (column-wise)
- Handle different indexes from the source DataFrames
- Specify how to handle overlapping columns (when combining horizontally)

Key Characteristics:
- Axis Flexibility: Can combine along rows (axis=0) or columns (axis=1)
- Index Preservation: By default keeps original indexes from source DataFrames
- Set Operations: Can perform unions (default) or intersections of indexes
- Multi-indexing: Can create hierarchical indexes when combining

In [27]:
# Combine two DataFrames row-wise (new entries)

data1 = pd.DataFrame({
    "Name": ["Moses", "Ruth"],
    "Score": [75, 82],
    "Age": [23, 21],
    "Location": ["Ngara", "Thika"]
})

combined = pd.concat([data, data1])
combined

Unnamed: 0,Name,Score,Age,Location
0,John,85,20,Juja
1,Reagan,90,21,Utawala
2,Nathan,78,19,Uthiru
3,Abdullahi,88,22,Kilimani
4,Bethany,92,20,Membley
0,Moses,75,23,Ngara
1,Ruth,82,21,Thika


**Why Indexes Repeat (0 and 1)**
- `pd.concat()` preserves the original indexes from each DataFrame by default
- Not specifying `ignore_index=True`, it keeps the source indexes

In [28]:
combined = pd.concat([data, data1], ignore_index=True)
combined

Unnamed: 0,Name,Score,Age,Location
0,John,85,20,Juja
1,Reagan,90,21,Utawala
2,Nathan,78,19,Uthiru
3,Abdullahi,88,22,Kilimani
4,Bethany,92,20,Membley
5,Moses,75,23,Ngara
6,Ruth,82,21,Thika


In [7]:
# Combine column-wise

extra = pd.DataFrame({
    "Gender": ["M", "M", "M", "M", "F"]
})

# Add columns to existing df
extended = pd.concat([data, extra], axis=1)
extended

Unnamed: 0,Name,Score,Age,Location,Gender
0,John,85,20,Juja,M
1,Reagan,90,21,Utawala,M
2,Nathan,78,19,Uthiru,M
3,Abdullahi,88,22,Kilimani,M
4,Bethany,92,20,Membley,F


Use this when you're adding more features/columns to existing rows.

## 2. pd.merge() – Database-style Merge (Joins)

`pd.merge()` is primary a function for database-style joining of DataFrames, similar to SQL JOIN operations. It combines datasets based on the values of common columns (keys), offering various types of joins similar to SQL.

Key Features
- SQL-like Joins: Supports INNER, LEFT, RIGHT, OUTER, and CROSS joins
- Flexible Key Specification: Can join on single or multiple columns
- Index/Column Joins: Can join on indexes or columns
- Suffix Handling: Automatically handles duplicate column names
- Indicator Column: Can add a column showing merge source

In [8]:
data

Unnamed: 0,Name,Score,Age,Location
0,John,85,20,Juja
1,Reagan,90,21,Utawala
2,Nathan,78,19,Uthiru
3,Abdullahi,88,22,Kilimani
4,Bethany,92,20,Membley


In [6]:
join

Unnamed: 0,Name,Gender
0,John,M
1,Nathan,M
2,Bethany,F
3,Ryan,M


In [5]:
# Merge based on "Name"
join = pd.DataFrame({
    "Name": ["John", "Nathan", "Bethany", "Ryan"],
    "Gender": ["M", "M", "F", "M"]
})

merged = pd.merge(data, join, on="Name", how="left")
merged

Unnamed: 0,Name,Score,Age,Location,Gender
0,John,85,20,Juja,M
1,Reagan,90,21,Utawala,
2,Nathan,78,19,Uthiru,M
3,Abdullahi,88,22,Kilimani,
4,Bethany,92,20,Membley,F


1. Successful Matches (Got Gender values):
 - John (row 0) → Found in join → Gender = "M"
 - Nathan (row 2) → Found in join → Gender = "M"
 - Bethany (row 4) → Found in join → Gender = "F"

2. Unmatched Records (Got NaN):
 - Reagan → Not in join → Gender = NaN
 - Abdullahi → Not in join → Gender = NaN

3. Left Join Behavior:
 - All original rows from `data` are preserved
 - Only matching rows from `join` contribute gender values
 - Non-matching names get `NaN` in the Gender column

In [36]:
pd.merge(data, join, on="Name", how="inner")

Unnamed: 0,Name,Score,Age,Location,Gender
0,John,85,20,Juja,M
1,Nathan,78,19,Uthiru,M
2,Bethany,92,20,Membley,F


Key Characteristics of Inner Join:
  - Exclusive Matching: Only returns rows with matching keys in both tables
  - No Missing Values: Never produces NaN in the result (unlike left/right/outer joins)
  - Smaller Output: Typically produces fewer rows than the original tables

In [37]:
pd.merge(data, join, on="Name", how="right")

Unnamed: 0,Name,Score,Age,Location,Gender
0,John,85.0,20.0,Juja,M
1,Nathan,78.0,19.0,Uthiru,M
2,Bethany,92.0,20.0,Membley,F
3,Ryan,,,,M


Explanation:
- Keeps all rows from the right DataFrame (join)
- Adds matching data from the left DataFrame (data)
- For "Ryan" (exists in join but not in data), all left DataFrame columns become NaN
- Names existing only in left DataFrame (Reagan, Abdullahi) are dropped
- Shows how right joins can reveal "orphaned" records in the right table

In [9]:
data

Unnamed: 0,Name,Score,Age,Location
0,John,85,20,Juja
1,Reagan,90,21,Utawala
2,Nathan,78,19,Uthiru
3,Abdullahi,88,22,Kilimani
4,Bethany,92,20,Membley


In [42]:
suffix = pd.DataFrame({
    "Name": ["John", "Nathan", "Bethany", "Ryan"],
    "Gender": ["M", "M", "F", "M"],
    "Score": [85, 90, 78, 92]
})

pd.merge(data, suffix, on="Name", suffixes=('_original', '_gender'))

Unnamed: 0,Name,Score_original,Age,Location,Gender,Score_gender
0,John,85,20,Juja,M,85
1,Nathan,78,19,Uthiru,M,90
2,Bethany,92,20,Membley,F,78


1. Value Matching:
- For "John":
    - Score_original: 85 (from data)
    - Score_gender: 85 (from suffix)

- For "Nathan":
    - Score_original: 78 (from data)
    - Score_gender: 90 (from suffix)

- For "Bethany":
    - Score_original: 92 (from data)
    - Score_gender: 78 (from suffix)

This type of merge is particularly valuable when:
 - You need to compare values of the same metric from different sources
 - You want to track changes in values over time
 - You're combining datasets that measure the same attributes differently
 - You need to identify discrepancies between data sources

Use this when you have extra data linked by a common key.

Types of Join:
- how="inner" → only matching rows
- how="left" → all rows from left, matching from right
- how="right" → all rows from right
- how="outer" → all rows from both (fills in NaN where no match)

## 3. df.append() – (Deprecated) Add Rows
This method is deprecated in newer versions of pandas, so it's better to use `pd.concat()`.

In [10]:
new_student = pd.DataFrame([{
    "Name": "Naomi",
    "Score": 79,
    "Age": 20,
    "Location": "Ngong"
}])

Students = pd.append([data, new_student], ignore_index=True)
Students

AttributeError: module 'pandas' has no attribute 'append'

## 4. df.join() – Add Columns by Index
`df.join()` is a Pandas method that combines DataFrames based on their indexes rather than column values. It's essentially a specialized version of pd.merge() optimized for index-based operations, with different default behaviors.

Key Characteristics
- Index-Based Alignment: Joins DataFrames using their indexes (row labels)
- Column Concatenation: Horizontally combines columns from multiple DataFrames
- SQL-Like Joins: Supports left (default), right, inner, and outer join operations
- Convenience Method: Simplified syntax compared to merge() for index-based operations

In [11]:
extra_info = pd.DataFrame({
    "Gender": ["M", "M", "M", "M", "F"]
}, index=[0, 1, 2, 3, 4])

joined = data.join(extra_info)
joined


Unnamed: 0,Name,Score,Age,Location,Gender
0,John,85,20,Juja,M
1,Reagan,90,21,Utawala,M
2,Nathan,78,19,Uthiru,M
3,Abdullahi,88,22,Kilimani,M
4,Bethany,92,20,Membley,F


Use this when rows are aligned by index, not keys.

### Combining Data in Pandas – Official Documentation

### Reference Links

- [`concat()` docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) – Add rows or columns.
- [`merge()` docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html) – Combine DataFrames using a key (like SQL JOIN).
- [`join()` docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html) – Add columns by index.

---

### Summary: When to Use What?

| **Method** | **Use When...**                                           | **Aligns By**     |
|------------|-----------------------------------------------------------|-------------------|
| [`concat()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) | Adding rows or columns to a DataFrame               | Axis (row/col)     |
| [`merge()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html)  | Merging datasets based on a key (like SQL JOIN)    | Key/Column         |
| [`join()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html)   | Adding columns when indexes align exactly           | Index              |
| `append()` *(Deprecated)* | Use for adding one DataFrame to another            | Axis=0 (rows)      |

---
