## Pandas 

Pandas is an open-source library built on top of Python that provides high-performance, easy-to-use data structures and data analysis tools. It’s perfect for handling structured data (like tables), and it integrates well with other Python libraries like NumPy, Matplotlib, and Scikit-learn.

In [None]:
! pip install pandas numpy matplotlib seaborn

```python
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)
```

### Importing Pandas 

In [7]:
import pandas as pd 
# import numpy as np

### Reading Data in Pandas

In [10]:
df = pd.read_csv("./data/WineQT.csv")

In [11]:
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Id
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,0
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,1
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,2
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,3
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,4


### Accessing a Single Column

In [14]:
df["sulphates"]

0       0.56
1       0.68
2       0.65
3       0.58
4       0.56
        ... 
1138    0.75
1139    0.82
1140    0.58
1141    0.76
1142    0.71
Name: sulphates, Length: 1143, dtype: float64

In [15]:
df.sulphates

0       0.56
1       0.68
2       0.65
3       0.58
4       0.56
        ... 
1138    0.75
1139    0.82
1140    0.58
1141    0.76
1142    0.71
Name: sulphates, Length: 1143, dtype: float64

### Accessing a Single Row

In [16]:
df.iloc[2]

fixed acidity            7.800
volatile acidity         0.760
citric acid              0.040
residual sugar           2.300
chlorides                0.092
free sulfur dioxide     15.000
total sulfur dioxide    54.000
density                  0.997
pH                       3.260
sulphates                0.650
alcohol                  9.800
quality                  5.000
Id                       2.000
Name: 2, dtype: float64

### Filtering Data in Pandas

In [20]:
df[df["sulphates"] <= 0.5]

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Id
6,7.9,0.600,0.06,1.60,0.069,15.0,59.0,0.99640,3.30,0.46,9.4,5,6
7,7.3,0.650,0.00,1.20,0.065,15.0,21.0,0.99460,3.39,0.47,10.0,7,7
45,7.2,0.725,0.05,4.65,0.086,4.0,11.0,0.99620,3.41,0.39,10.9,5,64
46,7.2,0.725,0.05,4.65,0.086,4.0,11.0,0.99620,3.41,0.39,10.9,5,65
49,7.7,0.690,0.22,1.90,0.084,18.0,94.0,0.99610,3.31,0.48,9.5,5,72
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1012,7.8,0.530,0.01,1.60,0.077,3.0,19.0,0.99500,3.16,0.46,9.8,5,1420
1025,7.0,0.590,0.00,1.70,0.052,3.0,8.0,0.99600,3.41,0.47,10.3,5,1438
1027,6.9,0.580,0.20,1.75,0.058,8.0,22.0,0.99322,3.38,0.49,11.7,5,1443
1047,10.0,0.690,0.11,1.40,0.084,8.0,24.0,0.99578,2.88,0.47,9.7,5,1470


### Filtering with Multiple Conditions

In [22]:
df[(df["sulphates"] <= 0.5 ) & (df["alcohol"] == 9)]

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Id
118,7.3,0.55,0.03,1.6,0.072,17.0,42.0,0.9956,3.37,0.48,9.0,4,167
532,8.2,0.34,0.38,2.5,0.08,12.0,57.0,0.9978,3.3,0.47,9.0,6,746


### Adding and Modifying Data in Pandas

In [25]:
df["Total Acidity"] = (df["fixed acidity"] + df["volatile acidity"])

In [26]:
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Id,Total Acidity
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,0,8.1
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,1,8.68
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,2,8.56
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,3,11.48
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,4,8.1


### Grouping and Aggregating Data

In [31]:
df.groupby("chlorides")["density"].count()

chlorides
0.012    2
0.034    1
0.038    2
0.039    3
0.041    4
        ..
0.415    3
0.422    1
0.467    1
0.610    1
0.611    1
Name: density, Length: 131, dtype: int64

### Sorting Data

In [33]:
df.sort_values("fixed acidity")

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Id,Total Acidity
32,4.6,0.52,0.15,2.1,0.054,8.0,65.0,0.9934,3.9,0.56,13.1,4,45,5.12
589,4.9,0.42,0.0,2.1,0.048,16.0,42.0,0.99154,3.71,0.74,14.0,7,821,5.32
396,5.0,1.04,0.24,1.6,0.05,32.0,96.0,0.9934,3.74,0.62,11.5,5,553,6.04
935,5.0,0.74,0.0,1.2,0.041,16.0,46.0,0.99258,4.01,0.59,12.5,6,1321,5.74
898,5.0,0.38,0.01,1.6,0.048,26.0,60.0,0.99084,3.7,0.75,14.0,6,1270,5.38


### Dataframe 

In [52]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago'],
    'town':["Githurai","ruiru","thika"]
}

data

{'Name': ['Alice', 'Bob', 'Charlie'],
 'Age': [25, 30, 35],
 'City': ['New York', 'Los Angeles', 'Chicago'],
 'town': ['Githurai', 'ruiru', 'thika']}

In [53]:
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City,town
0,Alice,25,New York,Githurai
1,Bob,30,Los Angeles,ruiru
2,Charlie,35,Chicago,thika


### Series

In [54]:
df["Name"]

0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object

In [55]:
df.Age

0    25
1    30
2    35
Name: Age, dtype: int64

In [56]:
df["Age"].mean()

np.float64(30.0)

#### Multiple Cols


In [57]:
df[["Name","Age"]]

Unnamed: 0,Name,Age
0,Alice,25
1,Bob,30
2,Charlie,35


### Exporting Data

In [35]:
df.head().to_csv("./data/cities.csv")

### Reading a Dataset From File

In [59]:
df_wine = pd.read_csv("./data/WineQT.csv")
df_wine.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Id
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,0
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,1
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,2
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,3
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,4


### Basic DataFrame Operations

    Inspecting Data
        .head() - Returns the first 5 rows.
        .tail() - Returns the last 5 rows.
        .info() - Provides summary of DataFrame.
        .describe() - Returns descriptive statistics.

In [60]:
df_wine.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Id
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,0
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,1
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,2
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,3
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,4


In [61]:
df_wine.tail()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Id
1138,6.3,0.51,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6,1592
1139,6.8,0.62,0.08,1.9,0.068,28.0,38.0,0.99651,3.42,0.82,9.5,6,1593
1140,6.2,0.6,0.08,2.0,0.09,32.0,44.0,0.9949,3.45,0.58,10.5,5,1594
1141,5.9,0.55,0.1,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6,1595
1142,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5,1597


In [62]:
df_wine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1143 entries, 0 to 1142
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1143 non-null   float64
 1   volatile acidity      1143 non-null   float64
 2   citric acid           1143 non-null   float64
 3   residual sugar        1143 non-null   float64
 4   chlorides             1143 non-null   float64
 5   free sulfur dioxide   1143 non-null   float64
 6   total sulfur dioxide  1143 non-null   float64
 7   density               1143 non-null   float64
 8   pH                    1143 non-null   float64
 9   sulphates             1143 non-null   float64
 10  alcohol               1143 non-null   float64
 11  quality               1143 non-null   int64  
 12  Id                    1143 non-null   int64  
dtypes: float64(11), int64(2)
memory usage: 116.2 KB


In [65]:
df_wine.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Id
count,1143.0,1143.0,1143.0,1143.0,1143.0,1143.0,1143.0,1143.0,1143.0,1143.0,1143.0,1143.0,1143.0
mean,8.311111,0.531339,0.268364,2.532152,0.086933,15.615486,45.914698,0.99673,3.311015,0.657708,10.442111,5.657043,804.969379
std,1.747595,0.179633,0.196686,1.355917,0.047267,10.250486,32.78213,0.001925,0.156664,0.170399,1.082196,0.805824,463.997116
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0,0.0
25%,7.1,0.3925,0.09,1.9,0.07,7.0,21.0,0.99557,3.205,0.55,9.5,5.0,411.0
50%,7.9,0.52,0.25,2.2,0.079,13.0,37.0,0.99668,3.31,0.62,10.2,6.0,794.0
75%,9.1,0.64,0.42,2.6,0.09,21.0,61.0,0.997845,3.4,0.73,11.1,6.0,1209.5
max,15.9,1.58,1.0,15.5,0.611,68.0,289.0,1.00369,4.01,2.0,14.9,8.0,1597.0


#### Accessing Data

    Access a column:

In [None]:
df_wine["alcohol"]

0        9.4
1        9.8
2        9.8
3        9.8
4        9.4
        ... 
1138    11.0
1139     9.5
1140    10.5
1141    11.2
1142    10.2
Name: alcohol, Length: 1143, dtype: float64

#### Accessing Data with .loc[] and .iloc[] in Pandas

##### loc[] — Label-Based Indexing

The .loc[] method is used for label-based indexing. This means you access rows and columns by their labels (names), which could be column names or the index of the DataFrame.

```python
df.loc[row_label, column_label]
```

In [67]:

df_wine.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Id
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,0
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,1
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,2
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,3
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,4


In [68]:
#single row
df_wine.loc[2]

fixed acidity            7.800
volatile acidity         0.760
citric acid              0.040
residual sugar           2.300
chlorides                0.092
free sulfur dioxide     15.000
total sulfur dioxide    54.000
density                  0.997
pH                       3.260
sulphates                0.650
alcohol                  9.800
quality                  5.000
Id                       2.000
Name: 2, dtype: float64

In [69]:
# multiple rows [start:Stop]
df_wine.loc[2:3]

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Id
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,2
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,3


In [70]:
# multiple rows with specific columns
df_wine.loc[2:3,["alcohol","pH"]]

Unnamed: 0,alcohol,pH
2,9.8,3.26
3,9.8,3.16


### .iloc[] — Integer-Position Based Indexing

The .iloc[] method is used for integer-location based indexing. This means you access rows and columns by their integer position (indexing starts from 0).
Basic Syntax for .iloc[]:

```python
df.iloc[row_index, column_index]
```

    row_index: The integer position of the row (starting from 0).
    column_index: The integer position of the column (starting from 0).

In [38]:
df_wine.head()

NameError: name 'df_wine' is not defined

In [None]:
df_wine.iloc[0,[7,9]]

density      0.9978
sulphates    0.5600
Name: 0, dtype: float64

In [None]:
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "Salary": [50000, 60000, 70000]
}

df = pd.DataFrame(data)
print(df)

In [None]:
df["Salary"].fillna(df["Salary"].mean(), inplace=True) 

## Data Visualization: Telling Stories with Data