# Core Data Structures: Series and DataFrame
Pandas introduces two primary data structures:

1. **Series:** A one-dimensional labeled array capable of holding any data type (integers, strings, floats, Python objects, etc.). It's essentially a single column of data.

    Think of it like a single column in a spreadsheet, where each row has an index.

2. **DataFrame:** A two-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or a SQL table, where each column is a Series. It's the most commonly used Pandas object.

    Think of it as the entire spreadsheet, with multiple columns and rows.

**Getting Started: Installation and Import**

First, you need to install Pandas if you haven't already:

bash: `pip install pandas`

Then, in your Python script or Jupyter Notebook, you typically import it like this:

In [1]:
import warnings
import pandas as pd
import numpy as np
warnings.filterwarnings('ignore')

**Creating Data Structures**

**Creating a Series**

You can create a Series from a list, a NumPy array, or a dictionary.

In [2]:
# From a list
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print("Series from a list:")
print(s)

# From a Numpy array
arr = np.array([10, 20, 30, 40])
s_arr = pd.Series(arr)
print("\nSeries from a Numpy array:")
print(s_arr)

# From a dictionary (Keys become the index)
data = {'a': 100, 'b': 200, 'c': 300}
s_dict = pd.Series(data)
print("\nSeries froma dictionary")
print(s_dict)

Series from a list:
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Series from a Numpy array:
0    10
1    20
2    30
3    40
dtype: int64

Series froma dictionary
a    100
b    200
c    300
dtype: int64


**Creating a DataFrame**

DataFrames can be created in several ways, most commonly from dictionaries of Series/lists or NumPy arrays.

In [3]:
# From a dictionary of lists/Series
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [20, 30, 35, 40],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}
df = pd.DataFrame(data)
print("\nDataFrame from a dictionary:")
print(df)

# Specifying index and columns
df_indexed = pd.DataFrame(data, index=['a', 'b', 'c', 'd'], columns=['Age', 'Name', 'City'])
print("\nDataFrame with custom index and columns:")
print(df_indexed)

# From a list of dictionaries (each dictionary is a row)
data_list_of_dict = [
    {'Name': 'Eve', 'Age': 22, 'City': 'Miami'},
    {'Name': 'Frank', 'Age': 28, 'City': 'Boston'}
]
df_from_list_of_dict = pd.DataFrame(data_list_of_dict)
print("\nDataFrame from a list of dictionaries:")
print(df_from_list_of_dict)


DataFrame from a dictionary:
      Name  Age         City
0    Alice   20     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
3    David   40      Houston

DataFrame with custom index and columns:
   Age     Name         City
a   20    Alice     New York
b   30      Bob  Los Angeles
c   35  Charlie      Chicago
d   40    David      Houston

DataFrame from a list of dictionaries:
    Name  Age    City
0    Eve   22   Miami
1  Frank   28  Boston


**Viewing Data**

Once you have a DataFrame, you'll want to inspect it.

In [4]:
# Create a sample DataFrame for demonstration
data = {
    'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'B': [11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
    'C': ['x', 'y', 'z', 'x', 'y', 'z', 'x', 'y', 'z', 'x']
}
df = pd.DataFrame(data)

# Display the first 5 rows
print("df.head():")
print(df.head())

# Display the last 3 rows
print("\ndf.tail(3):")
print(df.tail(3))

# Get a concise summary of the DataFrame
print("\ndf.info():")
print(df.info())

# Get descriptive statistics
print("\ndf.describe():")
print(df.describe())

# Get the shape (number of rows, number of columns)
print("\ndf.shape:")
print(df.shape)

# Get the column names
print("\ndf.columns:")
print(df.columns)

# Get the index
print("\ndf.index:")
print(df.index)

df.head():
   A   B  C
0  1  11  x
1  2  12  y
2  3  13  z
3  4  14  x
4  5  15  y

df.tail(3):
    A   B  C
7   8  18  y
8   9  19  z
9  10  20  x

df.info():
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   A       10 non-null     int64 
 1   B       10 non-null     int64 
 2   C       10 non-null     object
dtypes: int64(2), object(1)
memory usage: 372.0+ bytes
None

df.describe():
              A         B
count  10.00000  10.00000
mean    5.50000  15.50000
std     3.02765   3.02765
min     1.00000  11.00000
25%     3.25000  13.25000
50%     5.50000  15.50000
75%     7.75000  17.75000
max    10.00000  20.00000

df.shape:
(10, 3)

df.columns:
Index(['A', 'B', 'C'], dtype='object')

df.index:
RangeIndex(start=0, stop=10, step=1)


**Selection and Indexing**

Accessing specific data is crucial.

**Selecting a Single Column**

In [5]:
print("Selecting columns 'A':")
print(df['A']) # Return a Series

print("\nSelecting columns 'B' (alternative method):")
print(df.B) # Works if column name is a valid Python identifier

Selecting columns 'A':
0     1
1     2
2     3
3     4
4     5
5     6
6     7
7     8
8     9
9    10
Name: A, dtype: int64

Selecting columns 'B' (alternative method):
0    11
1    12
2    13
3    14
4    15
5    16
6    17
7    18
8    19
9    20
Name: B, dtype: int64


**Selecting Multiple Columns**

In [6]:
print("\nSelecting columns 'A' and 'C':")
print(df[['A', 'C']]) # Return a DataFrame


Selecting columns 'A' and 'C':
    A  C
0   1  x
1   2  y
2   3  z
3   4  x
4   5  y
5   6  z
6   7  x
7   8  y
8   9  z
9  10  x


**Selecting Rows by Label (`.loc[]`)**

`.loc[]` is primarily label-based.

In [7]:
# Select a single row by its index label (e.g., row with index 2)
print("\nSelecting row with index 2 using  .loc:")
print(df.loc[2])

# Select multiple rows by index labels
print("\nSlecting row with indicies 0, 2, 4 using .loc:")
print(df.loc[[0, 2, 4]])

# Select rows by index labels (inclusive)
print("\nSelecting rows from index 1 to 3 (inclusive) using .loc:")
print(df.loc[1:3])

# Select specific rows and columns by labels
print("\nSelecting rows 0, 1 and columns 'A' and 'C' using .loc:")
print(df.loc[[0, 1], ['A', 'C']])


Selecting row with index 2 using  .loc:
A     3
B    13
C     z
Name: 2, dtype: object

Slecting row with indicies 0, 2, 4 using .loc:
   A   B  C
0  1  11  x
2  3  13  z
4  5  15  y

Selecting rows from index 1 to 3 (inclusive) using .loc:
   A   B  C
1  2  12  y
2  3  13  z
3  4  14  x

Selecting rows 0, 1 and columns 'A' and 'C' using .loc:
   A  C
0  1  x
1  2  y


**Selecting Rows by Position (`.iloc[]`)**

`.iloc[]` is primarily integer-position based.

In [8]:
# Selec a single row by its integer position (e.g., the 3rd row, which has index 2)
print("\nSelecting row at position 2 using .iloc:")
print(df.iloc[2])

# Select multiple rows by integer positions
print("\nSelecting rows at positions 0, 2, 4 using .iloc:")
print(df.iloc[[0, 2, 4]])

# Select a slice of rows by integer positions (exclusive of the end)
print("\nSelecting rows from position 1 up to (but not including) 4 using .iloc:")
print(df.iloc[1:4])

# Select specific rows and columns by integer positions
print("\nSelecting rows at positions 0, 1 and columns at positions 0, 2 using .iloc:")
print(df.iloc[[0, 1], [0, 2]])


Selecting row at position 2 using .iloc:
A     3
B    13
C     z
Name: 2, dtype: object

Selecting rows at positions 0, 2, 4 using .iloc:
   A   B  C
0  1  11  x
2  3  13  z
4  5  15  y

Selecting rows from position 1 up to (but not including) 4 using .iloc:
   A   B  C
1  2  12  y
2  3  13  z
3  4  14  x

Selecting rows at positions 0, 1 and columns at positions 0, 2 using .iloc:
   A  C
0  1  x
1  2  y


**Boolean Indexing (Filtering)**

This is a very powerful way to select data based on conditions.

In [9]:
# Select row where column 'A' is greater than 5
print("\nRows where 'A' > 5:")
print(df[df['A'] > 5])

# Select rows where column 'C' is 'x'
print("\nRows where 'C' is 'x':")
print(df[df['C'] == 'x'])

# combine multiple conditions
print("Row where 'A' > 5 and 'C' is 'x':")
print(df[df['A'] > 5 & (df['C'] == 'x')])

print("\nRows where 'A' > 5 or 'B' > 18:")
print(df[(df['A'] < 3) | (df['B'] > 18)])

# Using .isin() for multiple values
print("\nRows where 'C' is 'x' or 'y':")
print(df[df['C'].isin(['x', 'y'])])


Rows where 'A' > 5:
    A   B  C
5   6  16  z
6   7  17  x
7   8  18  y
8   9  19  z
9  10  20  x

Rows where 'C' is 'x':
    A   B  C
0   1  11  x
3   4  14  x
6   7  17  x
9  10  20  x
Row where 'A' > 5 and 'C' is 'x':
    A   B  C
1   2  12  y
2   3  13  z
3   4  14  x
4   5  15  y
5   6  16  z
6   7  17  x
7   8  18  y
8   9  19  z
9  10  20  x

Rows where 'A' > 5 or 'B' > 18:
    A   B  C
0   1  11  x
1   2  12  y
8   9  19  z
9  10  20  x

Rows where 'C' is 'x' or 'y':
    A   B  C
0   1  11  x
1   2  12  y
3   4  14  x
4   5  15  y
6   7  17  x
7   8  18  y
9  10  20  x


**Handling Missing Data**

Missing data is represented by `NaN` (Not a Number).

In [10]:
df_missing = df.copy()
df_missing.iloc[1, 0] = np.nan # set A[1] to NaN
df_missing.iloc[4, 1] =np.nan # set B[4] to NaN
df_missing.iloc[6, 2] = np.nan # set C[6] to NaN

print("\nDataFrame with missing values:")
print(df_missing)

# Check for missing values
print("\nChecking for missing values:")
print(df_missing.isnull().sum())

# Drop rows with any missing values
print("\nDataFrame after dropping rows with any missing values (df_missing.dropna()):")
print(df_missing.dropna())

# Fill missing values with a specific value (e.g., 0)
print("\nDataFrame after filling missing values with 0 (df_missing.fillna(0)):")
print(df_missing.fillna(0))

# Fill missing values with the mean of the column
print("\nDataFrame after filling NaN in 'A' with its mean:")
print(df_missing['A'].fillna(df_missing['A'].mean()))


DataFrame with missing values:
      A     B    C
0   1.0  11.0    x
1   NaN  12.0    y
2   3.0  13.0    z
3   4.0  14.0    x
4   5.0   NaN    y
5   6.0  16.0    z
6   7.0  17.0  NaN
7   8.0  18.0    y
8   9.0  19.0    z
9  10.0  20.0    x

Checking for missing values:
A    1
B    1
C    1
dtype: int64

DataFrame after dropping rows with any missing values (df_missing.dropna()):
      A     B  C
0   1.0  11.0  x
2   3.0  13.0  z
3   4.0  14.0  x
5   6.0  16.0  z
7   8.0  18.0  y
8   9.0  19.0  z
9  10.0  20.0  x

DataFrame after filling missing values with 0 (df_missing.fillna(0)):
      A     B  C
0   1.0  11.0  x
1   0.0  12.0  y
2   3.0  13.0  z
3   4.0  14.0  x
4   5.0   0.0  y
5   6.0  16.0  z
6   7.0  17.0  0
7   8.0  18.0  y
8   9.0  19.0  z
9  10.0  20.0  x

DataFrame after filling NaN in 'A' with its mean:
0     1.000000
1     5.888889
2     3.000000
3     4.000000
4     5.000000
5     6.000000
6     7.000000
7     8.000000
8     9.000000
9    10.000000
Name: A, dtype: float6

**Operations**

Pandas allows for various operations.

**Basic Arithmetic Operations**

In [11]:
df_ops = df.copy()
df_ops['D'] = df_ops['A'] + df_ops['B'] # Add two columns
print("\nDataFrame with new column 'D' (A + B):")
print(df_ops)

df_ops['E'] = df_ops['A'] * 2 # Multiply a column by a scalar
print("\nDataFrame with new column 'E' (A * 2):")
print(df_ops)


DataFrame with new column 'D' (A + B):
    A   B  C   D
0   1  11  x  12
1   2  12  y  14
2   3  13  z  16
3   4  14  x  18
4   5  15  y  20
5   6  16  z  22
6   7  17  x  24
7   8  18  y  26
8   9  19  z  28
9  10  20  x  30

DataFrame with new column 'E' (A * 2):
    A   B  C   D   E
0   1  11  x  12   2
1   2  12  y  14   4
2   3  13  z  16   6
3   4  14  x  18   8
4   5  15  y  20  10
5   6  16  z  22  12
6   7  17  x  24  14
7   8  18  y  26  16
8   9  19  z  28  18
9  10  20  x  30  20


**Aggregations**

In [12]:
# Mean of column 'A'
print("\nMean of column 'A':", df['A'].mean())

# Sum of column 'B'
print("Sum of column 'B':", df['B'].sum())

# Max value of column 'A'
print("Max of column 'A':", df['A'].max())

# Count of non-null values in column 'A'
print("Count of non-null in 'A':", df['A'].count())


Mean of column 'A': 5.5
Sum of column 'B': 155
Max of column 'A': 10
Count of non-null in 'A': 10


**Grouping Data (`.groupby()`)**

This is similar to SQL's GROUP BY.

In [13]:
# Group by 'C' and calculate the mean of 'A' and 'B' for each group
print("\nGroup by 'C' and calculate mean of 'A' and 'B':")
print(df.groupby('C').mean())

# Group by 'C' and count the occurrences of each unique value in 'C'
print("\nValue counts for column 'C':")
print(df['C'].value_counts())


Group by 'C' and calculate mean of 'A' and 'B':
     A     B
C           
x  5.5  15.5
y  5.0  15.0
z  6.0  16.0

Value counts for column 'C':
C
x    4
y    3
z    3
Name: count, dtype: int64


**Input/Output (I/O)**

Pandas can read and write data in various formats.

In [14]:
# To CSV
df.to_csv('my_data.csv', index=False) # index=False prevents writing the DataFrame index as a column
print("\nDataFrame saved to 'my_data.csv'")

# From CSV
df_from_csv = pd.read_csv('my_data.csv')
print("\nDataFrame read from 'my_data.csv':")
print(df_from_csv.head())

# To Excel
# df.to_excel('my_data.xlsx', sheet_name='Sheet1', index=False)
# print("\nDataFrame saved to 'my_data.xlsx'")

# From Excel
# df_from_excel = pd.read_excel('my_data.xlsx', sheet_name='Sheet1')
# print("\nDataFrame read from 'my_data.xlsx':")
# print(df_from_excel.head())


DataFrame saved to 'my_data.csv'

DataFrame read from 'my_data.csv':
   A   B  C
0  1  11  x
1  2  12  y
2  3  13  z
3  4  14  x
4  5  15  y
