# Session 17 Pandas DataFrame

In [1]:
import pandas as pd
import numpy as np

## üìò DataFrame (in Pandas)

A **DataFrame** is a **two-dimensional (2D)**, **labeled data structure** provided by the **`pandas`** library in Python. It is one of the most powerful and essential tools in data science for working with **tabular data**, similar to how data is stored in **Excel spreadsheets** or **SQL tables**.

### ‚úÖ Key Characteristics:

* Consists of **rows and columns**
* Each **column** can contain different **data types** (integers, floats, strings, etc.)
* Both **rows (index)** and **columns (labels)** are **labeled**, making it easy to select and manipulate data
* Ideal for data cleaning, analysis, and manipulation tasks

---

## üìê 2D Structure of DataFrame

A DataFrame is a **2D object**, meaning:

* **Rows** run horizontally (axis=0)
* **Columns** run vertically (axis=1)

### Example Table Representation:

```
DataFrame
  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
  ‚îÇ Name       ‚îÇ Age ‚îÇ City       ‚îÇ  ‚Üí Columns
  ‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
0 ‚îÇ Alice      ‚îÇ 25  ‚îÇ Delhi      ‚îÇ
1 ‚îÇ Bob        ‚îÇ 30  ‚îÇ Mumbai     ‚îÇ
2 ‚îÇ Charlie    ‚îÇ 22  ‚îÇ Kolkata    ‚îÇ
  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
     ‚Üë
   Index (Rows)
```

---

## üìä Role of Series in a DataFrame

The **`Series`** is the **building block** of a DataFrame.

* A **Series** is a **1D labeled array**.
* In a DataFrame:

  * Each **column** is a **Series** with the column name as the label and row indices as the index.
  * Each **row** is also treated as a **Series**, where the column labels serve as the index.

### Conceptual View:

* `DataFrame = Collection of Series objects (columns)`
* `Row = Series where index = column names`
* `Column = Series where index = row numbers`

---

## üíª Example:

```python
import pandas as pd

# Create a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 22],
    'City': ['Delhi', 'Mumbai', 'Kolkata']
}

df = pd.DataFrame(data)

# Display the DataFrame
print(df)

# Access a column (returns a Series)
print(type(df['Age']))      # Output: <class 'pandas.core.series.Series'>

# Access a row (also returns a Series)
print(type(df.loc[0]))      # Output: <class 'pandas.core.series.Series'>
```

---

### üìå Summary Table

| Feature          | Description                                   |
| ---------------- | --------------------------------------------- |
| **2D Structure** | Rows (axis=0) and Columns (axis=1)            |
| **Column**       | A `pandas.Series` object                      |
| **Row**          | Also behaves as a `pandas.Series`             |
| **Use Case**     | Essential in data analysis, cleaning, ML prep |

## Methods to read/create a Dataframe
Provided below are the methods using which Dataframes can either be read or created.

In [2]:
# creating df using a list
stud_list = [
    [100, 80, 10],
    [90, 70, 7],
    [120, 100, 14],
    [80, 5, 2],
    [100, 80, 10]
]

stud = pd.DataFrame(stud_list, columns=['iq', 'marks', 'pkg'])

In [3]:
# using dictionaries
stud_dict = {
    'name': ['nool', 'wool', 'rupaid', 'shabh', 'edit', 'amnit'],
    'iq': [100, 90, 120, 80, 0, 0],
    'marks' : [80, 70, 100, 50, 0, 0],
    'pkg': [10, 7, 14, 2, 0, 0]
}

students = pd.DataFrame(stud_dict)
students

Unnamed: 0,name,iq,marks,pkg
0,nool,100,80,10
1,wool,90,70,7
2,rupaid,120,100,14
3,shabh,80,50,2
4,edit,0,0,0
5,amnit,0,0,0


In [2]:
# using read_csv
movies = pd.read_csv('movies.csv')
movies

Unnamed: 0,title_x,imdb_id,poster_path,wiki_link,title_y,original_title,is_adult,year_of_release,runtime,genres,imdb_rating,imdb_votes,story,summary,tagline,actors,wins_nominations,release_date
0,Uri: The Surgical Strike,tt8291224,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Uri:_The_Surgica...,Uri: The Surgical Strike,Uri: The Surgical Strike,0,2019,138,Action|Drama|War,8.4,35112,Divided over five chapters the film chronicle...,Indian army special forces execute a covert op...,,Vicky Kaushal|Paresh Rawal|Mohit Raina|Yami Ga...,4 wins,11 January 2019 (USA)
1,Battalion 609,tt9472208,,https://en.wikipedia.org/wiki/Battalion_609,Battalion 609,Battalion 609,0,2019,131,War,4.1,73,The story revolves around a cricket match betw...,The story of Battalion 609 revolves around a c...,,Vicky Ahuja|Shoaib Ibrahim|Shrikant Kamat|Elen...,,11 January 2019 (India)
2,The Accidental Prime Minister (film),tt6986710,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/The_Accidental_P...,The Accidental Prime Minister,The Accidental Prime Minister,0,2019,112,Biography|Drama,6.1,5549,Based on the memoir by Indian policy analyst S...,Explores Manmohan Singh's tenure as the Prime ...,,Anupam Kher|Akshaye Khanna|Aahana Kumra|Atul S...,,11 January 2019 (USA)
3,Why Cheat India,tt8108208,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Why_Cheat_India,Why Cheat India,Why Cheat India,0,2019,121,Crime|Drama,6.0,1891,The movie focuses on existing malpractices in ...,The movie focuses on existing malpractices in ...,,Emraan Hashmi|Shreya Dhanwanthary|Snighdadeep ...,,18 January 2019 (USA)
4,Evening Shadows,tt6028796,,https://en.wikipedia.org/wiki/Evening_Shadows,Evening Shadows,Evening Shadows,0,2018,102,Drama,7.3,280,While gay rights and marriage equality has bee...,Under the 'Evening Shadows' truth often plays...,,Mona Ambegaonkar|Ananth Narayan Mahadevan|Deva...,17 wins & 1 nomination,11 January 2019 (India)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1624,Tera Mera Saath Rahen,tt0301250,https://upload.wikimedia.org/wikipedia/en/2/2b...,https://en.wikipedia.org/wiki/Tera_Mera_Saath_...,Tera Mera Saath Rahen,Tera Mera Saath Rahen,0,2001,148,Drama,4.9,278,Raj Dixit lives with his younger brother Rahu...,A man is torn between his handicapped brother ...,,Ajay Devgn|Sonali Bendre|Namrata Shirodkar|Pre...,,7 November 2001 (India)
1625,Yeh Zindagi Ka Safar,tt0298607,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Yeh_Zindagi_Ka_S...,Yeh Zindagi Ka Safar,Yeh Zindagi Ka Safar,0,2001,146,Drama,3.0,133,Hindi pop-star Sarina Devan lives a wealthy ...,A singer finds out she was adopted when the ed...,,Ameesha Patel|Jimmy Sheirgill|Nafisa Ali|Gulsh...,,16 November 2001 (India)
1626,Sabse Bada Sukh,tt0069204,,https://en.wikipedia.org/wiki/Sabse_Bada_Sukh,Sabse Bada Sukh,Sabse Bada Sukh,0,2018,\N,Comedy|Drama,6.1,13,Village born Lalloo re-locates to Bombay and ...,Village born Lalloo re-locates to Bombay and ...,,Vijay Arora|Asrani|Rajni Bala|Kumud Damle|Utpa...,,
1627,Daaka,tt10833860,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Daaka,Daaka,Daaka,0,2019,136,Action,7.4,38,Shinda tries robbing a bank so he can be wealt...,Shinda tries robbing a bank so he can be wealt...,,Gippy Grewal|Zareen Khan|,,1 November 2019 (USA)


In [3]:
ipl = pd.read_csv('ipl-matches.csv')
ipl

Unnamed: 0,ID,City,Date,Season,MatchNumber,Team1,Team2,Venue,TossWinner,TossDecision,SuperOver,WinningTeam,WonBy,Margin,method,Player_of_Match,Team1Players,Team2Players,Umpire1,Umpire2
0,1312200,Ahmedabad,2022-05-29,2022,Final,Rajasthan Royals,Gujarat Titans,"Narendra Modi Stadium, Ahmedabad",Rajasthan Royals,bat,N,Gujarat Titans,Wickets,7.0,,HH Pandya,"['YBK Jaiswal', 'JC Buttler', 'SV Samson', 'D ...","['WP Saha', 'Shubman Gill', 'MS Wade', 'HH Pan...",CB Gaffaney,Nitin Menon
1,1312199,Ahmedabad,2022-05-27,2022,Qualifier 2,Royal Challengers Bangalore,Rajasthan Royals,"Narendra Modi Stadium, Ahmedabad",Rajasthan Royals,field,N,Rajasthan Royals,Wickets,7.0,,JC Buttler,"['V Kohli', 'F du Plessis', 'RM Patidar', 'GJ ...","['YBK Jaiswal', 'JC Buttler', 'SV Samson', 'D ...",CB Gaffaney,Nitin Menon
2,1312198,Kolkata,2022-05-25,2022,Eliminator,Royal Challengers Bangalore,Lucknow Super Giants,"Eden Gardens, Kolkata",Lucknow Super Giants,field,N,Royal Challengers Bangalore,Runs,14.0,,RM Patidar,"['V Kohli', 'F du Plessis', 'RM Patidar', 'GJ ...","['Q de Kock', 'KL Rahul', 'M Vohra', 'DJ Hooda...",J Madanagopal,MA Gough
3,1312197,Kolkata,2022-05-24,2022,Qualifier 1,Rajasthan Royals,Gujarat Titans,"Eden Gardens, Kolkata",Gujarat Titans,field,N,Gujarat Titans,Wickets,7.0,,DA Miller,"['YBK Jaiswal', 'JC Buttler', 'SV Samson', 'D ...","['WP Saha', 'Shubman Gill', 'MS Wade', 'HH Pan...",BNJ Oxenford,VK Sharma
4,1304116,Mumbai,2022-05-22,2022,70,Sunrisers Hyderabad,Punjab Kings,"Wankhede Stadium, Mumbai",Sunrisers Hyderabad,bat,N,Punjab Kings,Wickets,5.0,,Harpreet Brar,"['PK Garg', 'Abhishek Sharma', 'RA Tripathi', ...","['JM Bairstow', 'S Dhawan', 'M Shahrukh Khan',...",AK Chaudhary,NA Patwardhan
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
945,335986,Kolkata,2008-04-20,2007/08,4,Kolkata Knight Riders,Deccan Chargers,Eden Gardens,Deccan Chargers,bat,N,Kolkata Knight Riders,Wickets,5.0,,DJ Hussey,"['WP Saha', 'BB McCullum', 'RT Ponting', 'SC G...","['AC Gilchrist', 'Y Venugopal Rao', 'VVS Laxma...",BF Bowden,K Hariharan
946,335985,Mumbai,2008-04-20,2007/08,5,Mumbai Indians,Royal Challengers Bangalore,Wankhede Stadium,Mumbai Indians,bat,N,Royal Challengers Bangalore,Wickets,5.0,,MV Boucher,"['L Ronchi', 'ST Jayasuriya', 'DJ Thornely', '...","['S Chanderpaul', 'R Dravid', 'LRPL Taylor', '...",SJ Davis,DJ Harper
947,335984,Delhi,2008-04-19,2007/08,3,Delhi Daredevils,Rajasthan Royals,Feroz Shah Kotla,Rajasthan Royals,bat,N,Delhi Daredevils,Wickets,9.0,,MF Maharoof,"['G Gambhir', 'V Sehwag', 'S Dhawan', 'MK Tiwa...","['T Kohli', 'YK Pathan', 'SR Watson', 'M Kaif'...",Aleem Dar,GA Pratapkumar
948,335983,Chandigarh,2008-04-19,2007/08,2,Kings XI Punjab,Chennai Super Kings,"Punjab Cricket Association Stadium, Mohali",Chennai Super Kings,bat,N,Chennai Super Kings,Runs,33.0,,MEK Hussey,"['K Goel', 'JR Hopes', 'KC Sangakkara', 'Yuvra...","['PA Patel', 'ML Hayden', 'MEK Hussey', 'MS Dh...",MR Benson,SL Shastri


## DataFrame Attributes and Methods
Up ahead, we will be discussing some of the **Attributes** and **Methods** that are often used upon **DataFrames**. Some of the crucial ones are discussed below:

### Shape
Provides the shape of the datafrane, which is typically the *size*, *dimensions*, or the *total number* of **rows**, **columns** and **tables** (in case of a **multidim DF**)

In [6]:
print(movies.shape)
print(ipl.shape)

(1629, 18)
(950, 20)


### Dtypes
Provides the **Data Type** of the DatFrame. Since a DF comprises of *multiple series*, and each series can have either **heterogeneous** or **homogeneous** types of data, each series is labelled with a Data Type. The provide us with the insights regarding the dimensions and the size of data.

A series with *heterogeneous* types of data is marked as an **Object**.

In [7]:
movies.dtypes

title_x              object
imdb_id              object
poster_path          object
wiki_link            object
title_y              object
original_title       object
is_adult              int64
year_of_release       int64
runtime              object
genres               object
imdb_rating         float64
imdb_votes            int64
story                object
summary              object
tagline              object
actors               object
wins_nominations     object
release_date         object
dtype: object

### Index
Provides the **indices** of the Dataframe, since each row is associated with its own index (1 to *n* by default).

In [8]:
ipl.index

RangeIndex(start=0, stop=950, step=1)

The above output, considering the `RangeIndex()` function suggests that an automatic index generation has ben used, taking numbers 1 to 950, starting at 0, and with 1 progressive index at each step.

### Columns
Provides the names of all the columns of the DataFrame.

In [9]:
movies.columns

Index(['title_x', 'imdb_id', 'poster_path', 'wiki_link', 'title_y',
       'original_title', 'is_adult', 'year_of_release', 'runtime', 'genres',
       'imdb_rating', 'imdb_votes', 'story', 'summary', 'tagline', 'actors',
       'wins_nominations', 'release_date'],
      dtype='object')

### Values

Going a little in depth with this one because I felt that this needs to be explained in a little bit detail.

The `.values` attribute of a **Pandas DataFrame** returns the **underlying data** as a **NumPy array** (or a similar array-like structure).

---

#### ‚úÖ Syntax:

```python
df.values
```

---

#### üìå What it returns:

* A **NumPy array** containing the **actual data** of the DataFrame **without any row or column labels**.
* The data types in the array will be upcasted if necessary (e.g., mixed types ‚Üí object dtype). With this

---

#### üíª Example:

```python
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 22],
    'Score': [88.5, 92.0, 79.5]
}

df = pd.DataFrame(data)
print(df)
```

**Output:**

```
      Name  Age  Score
0    Alice   25   88.5
1      Bob   30   92.0
2  Charlie   22   79.5
```

```python
print(df.values)
```

**Output:**

```python
array([['Alice', 25, 88.5],
       ['Bob', 30, 92.0],
       ['Charlie', 22, 79.5]], dtype=object)
```

---

#### üßæ Notes:

* If all columns are of **numeric type**, the returned array will be of **numeric dtype** (e.g., `float64` or `int64`).
* If there are **mixed types** (like strings and numbers), the resulting array has `dtype=object`.
* `.values` **does not include** index or column labels ‚Äî it is **pure data** only.

---

#### üß™ Use Cases in Data Science:

* Useful for converting a DataFrame into a **NumPy array** for:

  * Mathematical operations
  * Feeding into machine learning models (e.g., Scikit-learn)
  * Custom numerical processing

---

#### ‚ö†Ô∏è Caution:

`.values` is not always preferred in modern pandas versions due to better alternatives:

* `df.to_numpy()` is recommended instead, because it handles corner cases more robustly.

```python
df.to_numpy()  # Safer, modern alternative
```

---

Would you like me to explain `.to_numpy()` and how it's different from `.values` next?


### Head and Tail
`head()` and `tail()` are demonstrated in the previous notes. These gives the top few values and the bottom most few values.

In [10]:
# head() demonstration
movies.head()

Unnamed: 0,title_x,imdb_id,poster_path,wiki_link,title_y,original_title,is_adult,year_of_release,runtime,genres,imdb_rating,imdb_votes,story,summary,tagline,actors,wins_nominations,release_date
0,Uri: The Surgical Strike,tt8291224,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Uri:_The_Surgica...,Uri: The Surgical Strike,Uri: The Surgical Strike,0,2019,138,Action|Drama|War,8.4,35112,Divided over five chapters the film chronicle...,Indian army special forces execute a covert op...,,Vicky Kaushal|Paresh Rawal|Mohit Raina|Yami Ga...,4 wins,11 January 2019 (USA)
1,Battalion 609,tt9472208,,https://en.wikipedia.org/wiki/Battalion_609,Battalion 609,Battalion 609,0,2019,131,War,4.1,73,The story revolves around a cricket match betw...,The story of Battalion 609 revolves around a c...,,Vicky Ahuja|Shoaib Ibrahim|Shrikant Kamat|Elen...,,11 January 2019 (India)
2,The Accidental Prime Minister (film),tt6986710,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/The_Accidental_P...,The Accidental Prime Minister,The Accidental Prime Minister,0,2019,112,Biography|Drama,6.1,5549,Based on the memoir by Indian policy analyst S...,Explores Manmohan Singh's tenure as the Prime ...,,Anupam Kher|Akshaye Khanna|Aahana Kumra|Atul S...,,11 January 2019 (USA)
3,Why Cheat India,tt8108208,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Why_Cheat_India,Why Cheat India,Why Cheat India,0,2019,121,Crime|Drama,6.0,1891,The movie focuses on existing malpractices in ...,The movie focuses on existing malpractices in ...,,Emraan Hashmi|Shreya Dhanwanthary|Snighdadeep ...,,18 January 2019 (USA)
4,Evening Shadows,tt6028796,,https://en.wikipedia.org/wiki/Evening_Shadows,Evening Shadows,Evening Shadows,0,2018,102,Drama,7.3,280,While gay rights and marriage equality has bee...,Under the 'Evening Shadows' truth often plays...,,Mona Ambegaonkar|Ananth Narayan Mahadevan|Deva...,17 wins & 1 nomination,11 January 2019 (India)


In [11]:
# tail demonstration
ipl.tail()

Unnamed: 0,ID,City,Date,Season,MatchNumber,Team1,Team2,Venue,TossWinner,TossDecision,SuperOver,WinningTeam,WonBy,Margin,method,Player_of_Match,Team1Players,Team2Players,Umpire1,Umpire2
945,335986,Kolkata,2008-04-20,2007/08,4,Kolkata Knight Riders,Deccan Chargers,Eden Gardens,Deccan Chargers,bat,N,Kolkata Knight Riders,Wickets,5.0,,DJ Hussey,"['WP Saha', 'BB McCullum', 'RT Ponting', 'SC G...","['AC Gilchrist', 'Y Venugopal Rao', 'VVS Laxma...",BF Bowden,K Hariharan
946,335985,Mumbai,2008-04-20,2007/08,5,Mumbai Indians,Royal Challengers Bangalore,Wankhede Stadium,Mumbai Indians,bat,N,Royal Challengers Bangalore,Wickets,5.0,,MV Boucher,"['L Ronchi', 'ST Jayasuriya', 'DJ Thornely', '...","['S Chanderpaul', 'R Dravid', 'LRPL Taylor', '...",SJ Davis,DJ Harper
947,335984,Delhi,2008-04-19,2007/08,3,Delhi Daredevils,Rajasthan Royals,Feroz Shah Kotla,Rajasthan Royals,bat,N,Delhi Daredevils,Wickets,9.0,,MF Maharoof,"['G Gambhir', 'V Sehwag', 'S Dhawan', 'MK Tiwa...","['T Kohli', 'YK Pathan', 'SR Watson', 'M Kaif'...",Aleem Dar,GA Pratapkumar
948,335983,Chandigarh,2008-04-19,2007/08,2,Kings XI Punjab,Chennai Super Kings,"Punjab Cricket Association Stadium, Mohali",Chennai Super Kings,bat,N,Chennai Super Kings,Runs,33.0,,MEK Hussey,"['K Goel', 'JR Hopes', 'KC Sangakkara', 'Yuvra...","['PA Patel', 'ML Hayden', 'MEK Hussey', 'MS Dh...",MR Benson,SL Shastri
949,335982,Bangalore,2008-04-18,2007/08,1,Royal Challengers Bangalore,Kolkata Knight Riders,M Chinnaswamy Stadium,Royal Challengers Bangalore,field,N,Kolkata Knight Riders,Runs,140.0,,BB McCullum,"['R Dravid', 'W Jaffer', 'V Kohli', 'JH Kallis...","['SC Ganguly', 'BB McCullum', 'RT Ponting', 'D...",Asad Rauf,RE Koertzen


### Sample

The `sample()` function in pandas is used to **randomly select rows or columns** from a DataFrame. It is extremely useful in data science for tasks such as:

* Creating a **train-test split**
* Doing **random sampling** for exploratory data analysis (EDA)
* Testing pipelines with smaller data
* Bootstrapping methods

---

#### ‚úÖ Syntax:

```python
df.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None)
```

---

#### üìå Parameters:

| Parameter      | Description                                                     |
| -------------- | --------------------------------------------------------------- |
| `n`            | Number of items to return (rows or columns depending on `axis`) |
| `frac`         | Fraction of items to return (e.g. 0.5 for 50%)                  |
| `replace`      | If `True`, sample with replacement                              |
| `weights`      | Probability weights for sampling (can be a column or list)      |
| `random_state` | For reproducibility; sets the seed                              |
| `axis`         | 0 for rows (default), 1 for columns                             |

---

#### üß™ Example 1: Random Row Sampling

```python
import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Score': [85, 92, 88, 76, 95]
})

# Sample 2 random rows
df.sample(n=2)
```

---

#### üß™ Example 2: Sample 40% of the Data

```python
df.sample(frac=0.4)
```

---

#### üß™ Example 3: Sampling with Replacement

`Replacement` enables the **repition** of chosen rows/columns. In case this is set as `False`, each row/col in the sample would be unique.
```python
df.sample(n=3, replace=True)
```

---

#### üß™ Example 4: Sampling with a Fixed Random Seed (Reproducibility)

```python
df.sample(n=2, random_state=42)
```

This will always return the same rows on every run (good for experiments).

---

#### üß™ Example 5: Column Sampling

```python
df.sample(n=1, axis=1)
```

Returns a random **column** instead of row.

---

#### üéØ Use Cases in Data Science:

| Use Case         | Description                                       |
| ---------------- | ------------------------------------------------- |
| Train-Test Split | Randomly sample rows for training/testing         |
| Debugging        | Work on a small subset of the data                |
| Bootstrapping    | Create samples with replacement                   |
| Visual EDA       | Sample for quick plotting or insights             |
| Bias Detection   | Analyze if models are sensitive to random subsets |

Provided below are some of the demonstration done on the dataset that we previously imported.

In [12]:
# selecting 2 random samples
movies.sample(n=2)

Unnamed: 0,title_x,imdb_id,poster_path,wiki_link,title_y,original_title,is_adult,year_of_release,runtime,genres,imdb_rating,imdb_votes,story,summary,tagline,actors,wins_nominations,release_date
529,Khoobsurat (2014 film),tt3554418,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Khoobsurat_(2014...,Khoobsurat,Khoobsurat,0,2014,130,Comedy|Romance,6.4,6981,Khoobsurat is a quirky modern romantic comedy...,A hopelessly romantic physiotherapist meets a ...,,Sonam Kapoor|Fawad Khan|Ratna Pathak Shah|Kiro...,2 wins & 2 nominations,19 September 2014 (USA)
923,Kisse Pyaar Karoon,tt0438894,,https://en.wikipedia.org/wiki/Kisse_Pyaar_Karoon,Kisse Pyaar Karoon?,Kisse Pyaar Karoon?,0,2009,123,Action|Comedy|Crime,3.3,120,Mumbai-based collegian-slackers Sid(Arshad War...,Two men abduct a woman who wants to alienate t...,,Arshad Warsi|Aashish Chaudhary|Yash Tonk|Udita...,,27 February 2009 (India)


In [13]:
# random sampling 10% of the data
movies.sample(frac=0.1)

Unnamed: 0,title_x,imdb_id,poster_path,wiki_link,title_y,original_title,is_adult,year_of_release,runtime,genres,imdb_rating,imdb_votes,story,summary,tagline,actors,wins_nominations,release_date
1004,Haal-e-Dil,tt1252488,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Haal-e-Dil,Haal-e-Dil,Haal-e-Dil,0,2008,124,Drama|Romance,3.4,224,While traveling by train Shekhar attempts to w...,While traveling by train Shekhar attempts to w...,,Amita Pathak|Nakuul Mehta|Adhyayan Suman|,,20 June 2008 (India)
113,Phamous,tt8338746,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Phamous,Phamous,Phamous,0,2018,115,Comedy|Crime|Drama,3.7,207,Set in the wild wild east the story of Phamou...,Set in the wild wild east the story of Phamou...,,Jimmy Sheirgill|Shriya Saran|Kay Kay Menon|Pan...,,1 June 2018 (India)
1614,Moksha (2001 film),tt0301240,https://upload.wikimedia.org/wikipedia/en/6/6e...,https://en.wikipedia.org/wiki/Moksha_(2001_film),Moksha: Salvation,Moksha: Salvation,0,2001,150,Drama|Thriller,6.3,356,Appalled at the manner lawyers treat less affl...,An idealistic lawyer wants to help the poor ge...,,Arjun Rampal|Manisha Koirala|Kalpana Pandit|Su...,3 wins & 2 nominations,30 November 2001 (India)
1264,Koi Aap Sa,tt0488840,,https://en.wikipedia.org/wiki/Koi_Aap_Sa,Koi Aap Sa: But Lovers Have to Be Friends,Koi Aap Sa: But Lovers Have to Be Friends,0,2005,142,Comedy|Drama|Romance,5.4,194,An emotional Mumbai-based football player Roh...,An emotional Mumbai-based football player Roh...,Friends may not be lovers but lovers have to b...,Aftab Shivdasani|Anita Hassanandani Reddy|Dipa...,,14 October 2005 (India)
799,Force (2011 film),tt1992138,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Force_(2011_film),Force,Force,0,2011,137,Action|Thriller,6.4,6497,Critically wounded and lying unconscious in a ...,A vengeful drug-dealer/gangster targets and te...,Bound by Duty| Unleashed by Love.,John Abraham|Genelia D'Souza|Raj Babbar|Mohnis...,2 wins & 1 nomination,30 September 2011 (India)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
952,Fox (film),tt1324076,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Fox_(film),Fox,Fox,0,2009,145,Crime|Drama|Mystery,5.2,512,Mumbai-based Advocate Arjun Kapoor decides to ...,A disgraced lawyer-turned-author is arrested f...,,Arjun Rampal|Sunny Deol|Udita Goswami|Sagarika...,,4 September 2009 (India)
982,Jodhaa Akbar,tt0449994,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Jodhaa_Akbar,Jodhaa Akbar,Jodhaa Akbar,0,2008,213,Action|Drama|History,7.6,27541,Jodhaa Akbar is a sixteenth century love story...,A sixteenth century love story about a marriag...,,Hrithik Roshan|Aishwarya Rai Bachchan|Sonu Soo...,32 wins & 21 nominations,15 February 2008 (USA)
1372,Phir Milenge,tt0422950,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Phir_Milenge,Phir Milenge,Phir Milenge,0,2004,142,Drama,6.1,1590,Tamanna Sahni (Shilpa Shetty) is a dedicated s...,Tamanna Sahni (Shilpa Shetty) is a dedicated s...,,Salman Khan|Abhishek Bachchan|Shilpa Shetty Ku...,3 wins & 5 nominations,13 August 2004 (India)
1297,Sheesha (2005 film),tt0445056,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Sheesha_(2005_film),Sheesha,Sheesha,0,2005,\N,Drama|Romance|Thriller,3.5,238,Businesswoman Sia Malhotra lives a wealthy lif...,Businesswoman Sia Malhotra lives a wealthy lif...,Some mirrors lie,Neha Dhupia|Sonu Sood|Vivek Shauq|Elidh MacQue...,,11 February 2005 (India)


In [14]:
ipl.sample(n=3, replace=True)

Unnamed: 0,ID,City,Date,Season,MatchNumber,Team1,Team2,Venue,TossWinner,TossDecision,SuperOver,WinningTeam,WonBy,Margin,method,Player_of_Match,Team1Players,Team2Players,Umpire1,Umpire2
831,419110,Chennai,2010-03-14,2009/10,5,Chennai Super Kings,Deccan Chargers,"MA Chidambaram Stadium, Chepauk",Deccan Chargers,bat,N,Deccan Chargers,Runs,31.0,,WPUJC Vaas,"['M Vijay', 'ML Hayden', 'SK Raina', 'S Badrin...","['AC Gilchrist', 'VVS Laxman', 'HH Gibbs', 'A ...",K Hariharan,DJ Harper
688,548319,Chandigarh,2012-04-12,2012,14,Kings XI Punjab,Pune Warriors,"Punjab Cricket Association Stadium, Mohali",Kings XI Punjab,field,N,Kings XI Punjab,Wickets,7.0,,AD Mascarenhas,"['PC Valthaty', 'AC Gilchrist', 'SE Marsh', 'M...","['JD Ryder', 'SC Ganguly', 'MN Samuels', 'RV U...",VA Kulkarni,SK Tarapore
246,1175363,Hyderabad,2019-03-29,2019,8,Rajasthan Royals,Sunrisers Hyderabad,Rajiv Gandhi International Stadium,Rajasthan Royals,bat,N,Sunrisers Hyderabad,Wickets,5.0,,Rashid Khan,"['AM Rahane', 'JC Buttler', 'SV Samson', 'BA S...","['DA Warner', 'JM Bairstow', 'KS Williamson', ...",BNJ Oxenford,C Shamshuddin


In [15]:
movies.sample(n=2, random_state=42)

Unnamed: 0,title_x,imdb_id,poster_path,wiki_link,title_y,original_title,is_adult,year_of_release,runtime,genres,imdb_rating,imdb_votes,story,summary,tagline,actors,wins_nominations,release_date
669,Kahaani,tt1821480,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Kahaani,Kahaani,Kahaani,0,2012,122,Mystery|Thriller,8.1,53181,Kolkata is abuzz with the preparations for the...,A pregnant woman's search for her missing husb...,A mother of a story,Vidya Balan|Parambrata Chattopadhyay|Indraneil...,19 wins & 18 nominations,9 March 2012 (India)
251,Monsoon Shootout,tt2198235,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Monsoon_Shootout,Monsoon Shootout,Monsoon Shootout,0,2013,92,Action|Crime|Drama,6.5,801,As the raging monsoon lashes Mumbai the comme...,As heavy rains lash Mumbai a cop on his first...,A rookie cop's moment of reckoning| to shoot o...,Vijay Varma|Nawazuddin Siddiqui|Neeraj Kabi|Ge...,1 win & 7 nominations,15 December 2017 (India)


In [16]:
movies.sample(n=2, axis=1)

Unnamed: 0,title_y,year_of_release
0,Uri: The Surgical Strike,2019
1,Battalion 609,2019
2,The Accidental Prime Minister,2019
3,Why Cheat India,2019
4,Evening Shadows,2018
...,...,...
1624,Tera Mera Saath Rahen,2001
1625,Yeh Zindagi Ka Safar,2001
1626,Sabse Bada Sukh,2018
1627,Daaka,2019


### Info
Provides a **High-Level information** about the DataFrame. This method prints information about a DataFrame including the **index dtype**, **columns**, **non-null values** and **memory usage**.

In [17]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1629 entries, 0 to 1628
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   title_x           1629 non-null   object 
 1   imdb_id           1629 non-null   object 
 2   poster_path       1526 non-null   object 
 3   wiki_link         1629 non-null   object 
 4   title_y           1629 non-null   object 
 5   original_title    1629 non-null   object 
 6   is_adult          1629 non-null   int64  
 7   year_of_release   1629 non-null   int64  
 8   runtime           1629 non-null   object 
 9   genres            1629 non-null   object 
 10  imdb_rating       1629 non-null   float64
 11  imdb_votes        1629 non-null   int64  
 12  story             1609 non-null   object 
 13  summary           1629 non-null   object 
 14  tagline           557 non-null    object 
 15  actors            1624 non-null   object 
 16  wins_nominations  707 non-null    object 


### Describe
Describe is used to present some of the mathematical summary, further on which could be used for performing aggregate operations. Note that `describe()` identifies the columns that contain only numeric values and then provides their mathematical information.

In [18]:
movies.describe()

Unnamed: 0,is_adult,year_of_release,imdb_rating,imdb_votes
count,1629.0,1629.0,1629.0,1629.0
mean,0.0,2010.263966,5.557459,5384.263352
std,0.0,5.381542,1.567609,14552.103231
min,0.0,2001.0,0.0,0.0
25%,0.0,2005.0,4.4,233.0
50%,0.0,2011.0,5.6,1000.0
75%,0.0,2015.0,6.8,4287.0
max,0.0,2019.0,9.4,310481.0


### Isnull
Reconstrcuts a **Boolean** DataFrame over the provided DataFrame, placing the values as `True` where there are null values and `False` where there aren't.

In [19]:
movies.isnull()

Unnamed: 0,title_x,imdb_id,poster_path,wiki_link,title_y,original_title,is_adult,year_of_release,runtime,genres,imdb_rating,imdb_votes,story,summary,tagline,actors,wins_nominations,release_date
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False
1,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,True,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,False
4,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1624,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,False
1625,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,False
1626,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,True,True
1627,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,False


However, at times, we may need to see which of the column has *null* values, and how much. For this, we can sum up the count of the `True` values.

In [20]:
movies.isnull().sum()

title_x                0
imdb_id                0
poster_path          103
wiki_link              0
title_y                0
original_title         0
is_adult               0
year_of_release        0
runtime                0
genres                 0
imdb_rating            0
imdb_votes             0
story                 20
summary                0
tagline             1072
actors                 5
wins_nominations     922
release_date         107
dtype: int64

### Duplicated
Displays whether there are **duplicate data** present in the data or not. Since duplicated data are always *pointless*, they serve no purpose and shall be removed upon identification.

In [21]:
movies.duplicated().sum()

np.int64(0)

In [22]:
ipl.duplicated().sum()

np.int64(0)

In [23]:
print(stud.duplicated())
stud.duplicated().sum()

0    False
1    False
2    False
3    False
4     True
dtype: bool


np.int64(1)

### Rename
`rename()` is used to temporatily rename the column names of a DataFrame. However, the changes made using this method will be temporary, unless they are stored in a variable.

In [24]:
stud.rename(columns={'marks': 'percent', 'package': 'lpa'})

Unnamed: 0,iq,percent,pkg
0,100,80,10
1,90,70,7
2,120,100,14
3,80,5,2
4,100,80,10


However upon rechecking the original `stud` variable, we will again get the original column names.

In [25]:
stud

Unnamed: 0,iq,marks,pkg
0,100,80,10
1,90,70,7
2,120,100,14
3,80,5,2
4,100,80,10


Note that *parameter* `inplace` can be set to `True` to make the changes permanent, in the original variable.

## Mathematical Methods
In this section, we will be discussing some of the mathematical functions available over DataFrames. A lot of these would be inspired from those of series, since a DataFrame is a collaboration of mutiple Series.

### Sum
Applies `sum()` over all the columns of the DataFrame. Note that in case the **dtype** of the Series is *string* or *float*, in such a case the strings will be **concatenated**.

In most of the cases, it is not logical to perform **sum** over all the columns of the DataFrames. Additionally, `sum()` can also be performed on the rows of the DataFrame by setting the `axis` param as `1`.

In [26]:
stud.sum()

iq       490
marks    335
pkg       43
dtype: int64

In [27]:
# for row wise sum, although this isn't used much
stud.sum(axis=1)

0    190
1    167
2    234
3     87
4    190
dtype: int64

Similarly, we have other functionalities such as `min()`, `max()`, `mean()`, `median()`, `std()`, and `var()`.

## Selecting *specific* **Columns** from a DataFrame

In a lot of scenarios, we may be required to select specific Columns from the Dataframe, rather than selecting the whole dataframe. This can be done using the following ways:

In [28]:
# to display the column 'Venue' of DF ipl
ipl['Venue']

0                Narendra Modi Stadium, Ahmedabad
1                Narendra Modi Stadium, Ahmedabad
2                           Eden Gardens, Kolkata
3                           Eden Gardens, Kolkata
4                        Wankhede Stadium, Mumbai
                          ...                    
945                                  Eden Gardens
946                              Wankhede Stadium
947                              Feroz Shah Kotla
948    Punjab Cricket Association Stadium, Mohali
949                         M Chinnaswamy Stadium
Name: Venue, Length: 950, dtype: object

In the above code, we are selecting a single column and so it will be of type **Series**, whereas multiple columns will result in a data of type **DataFrame**.

Moreover, we will now try to extract multiple columns from a DF using a method similar to *Slicing*. The order of columns will be decided by the placement of the columns in the code responsible for it.

In [29]:
movies[['title_x', 'year_of_release', 'actors']]

Unnamed: 0,title_x,year_of_release,actors
0,Uri: The Surgical Strike,2019,Vicky Kaushal|Paresh Rawal|Mohit Raina|Yami Ga...
1,Battalion 609,2019,Vicky Ahuja|Shoaib Ibrahim|Shrikant Kamat|Elen...
2,The Accidental Prime Minister (film),2019,Anupam Kher|Akshaye Khanna|Aahana Kumra|Atul S...
3,Why Cheat India,2019,Emraan Hashmi|Shreya Dhanwanthary|Snighdadeep ...
4,Evening Shadows,2018,Mona Ambegaonkar|Ananth Narayan Mahadevan|Deva...
...,...,...,...
1624,Tera Mera Saath Rahen,2001,Ajay Devgn|Sonali Bendre|Namrata Shirodkar|Pre...
1625,Yeh Zindagi Ka Safar,2001,Ameesha Patel|Jimmy Sheirgill|Nafisa Ali|Gulsh...
1626,Sabse Bada Sukh,2018,Vijay Arora|Asrani|Rajni Bala|Kumud Damle|Utpa...
1627,Daaka,2019,Gippy Grewal|Zareen Khan|


In [30]:
print(type(movies[['title_x', 'year_of_release', 'actors']]), type(ipl['Venue']))

<class 'pandas.core.frame.DataFrame'> <class 'pandas.core.series.Series'>


## Selecting *specific* **Rows** from a DataFrame
In the similar way as above, we can also select specific rows from a DataFrame.

For this, we must first learn about *two* methods:
1. `iloc`: searches using **index positions**
2. `loc`: searches using **index labels**

Here is a basic understanding of `iloc` and `loc`, and then we will move to the usecases on our own dataframe.

## üìç `iloc` vs `loc` in Pandas

`iloc` and `loc` are two of the most important **data selection methods** in pandas. They help in accessing rows and columns of a DataFrame in different ways:

---

### üîπ `iloc` (Integer Location Based Indexing)

* Stands for **integer-location based indexing**
* Access rows/columns by **position**
* Uses **integer indexes**
* **Exclusive of the end index** in slicing (like Python lists)

#### üî∏ Syntax:

```python
df.iloc[<row_index>, <column_index>]
```

#### ‚úÖ Examples:

```python
df.iloc[0]            # First row
df.iloc[0, 1]         # Value at 1st row and 2nd column
df.iloc[2:5]          # Rows from index 2 to 4
df.iloc[:, 0]         # All rows, 1st column
```

---

### üîπ `loc` (Label Based Indexing)

* Stands for **label-based indexing**
* Access rows/columns by **labels/names**
* Uses **row/column labels**
* **Inclusive of the end label** in slicing

#### üî∏ Syntax:

```python
df.loc[<row_label>, <column_label>]
```

#### ‚úÖ Examples:

```python
df.loc[0]                    # Row with index label 0
df.loc[0, 'Name']            # Value from row label 0 and column 'Name'
df.loc[1:3, ['Name', 'Score']]  # Multiple row and column labels
df.loc[:, 'Score']           # All rows, 'Score' column
```

---

### üéØ Use Cases

| Use Case                                                | `iloc`                    | `loc`                      |
| ------------------------------------------------------- | ------------------------- | -------------------------- |
| Working with positions (e.g. nth row/column)            | ‚úÖ Best suited             | ‚ùå Not applicable           |
| Accessing by label (e.g. 'Age', 'Name')                 | ‚ùå Not allowed             | ‚úÖ Required                 |
| Iterating row-wise/column-wise numerically              | ‚úÖ                         | ‚ùå                          |
| Filtering with logical conditions (on index or columns) | ‚ùå Only works with numeric | ‚úÖ Fully supported          |
| Selecting a column by name                              | ‚ùå Use `iloc[:, index]`    | ‚úÖ Use `loc[:, 'col_name']` |
| Slicing including last element                          | ‚ùå Exclusive end           | ‚úÖ Inclusive end            |

---

### ‚ö†Ô∏è Quick Comparison

| Feature         | `iloc`          | `loc`                       |
| --------------- | --------------- | --------------------------- |
| Index type      | Integer         | Label                       |
| Syntax          | `df.iloc[2, 1]` | `df.loc[2, 'Name']`         |
| Slice behavior  | Exclusive end   | Inclusive end               |
| Error if label? | Yes             | No                          |
| Error if int?   | No              | Yes (unless int is a label) |

---

### ‚úÖ **Core Difference Between `iloc` and `loc`**

| Feature                    | `iloc`                                     | `loc`                                                  |
| -------------------------- | ------------------------------------------ | ------------------------------------------------------ |
| **Based on**               | **Integer positions** (0-based)            | **Labels** or names of rows/columns                    |
| **Indexing**               | Works like Python lists                    | Works like dictionary keys                             |
| **Use case**               | When you know the position                 | When you know the label                                |
| **Slicing**                | End index is **exclusive** (`[start:end)`) | End index is **inclusive** (`[start:end]`)             |
| **Error on label**         | ‚ùå Cannot use string labels                 | ‚úÖ Can use string labels                                |
| **Error on integer index** | ‚úÖ Accepts any valid integer                | ‚ùå Fails if label does not exist or is not integer type |

---

### üéØ In One Line:

> **`iloc`** is **position-based**,
> **`loc`** is **label-based**.

---

### üîç Example:

```python
import pandas as pd

df = pd.DataFrame({
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35]
}, index=["a", "b", "c"])
```

#### Using `iloc`:

```python
df.iloc[0]        # First row (index 'a')
df.iloc[1, 1]     # Value 30 (row 2, column 2)
```

#### Using `loc`:

```python
df.loc["a"]       # Row with label 'a'
df.loc["b", "Age"] # Value 30 (label 'b', column 'Age')
```

### `iloc`

In [8]:
import pandas as pd

df = pd.DataFrame({
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35]
}, index=["a", "b", "c"])

df.iloc[0]

Name    Alice
Age        25
Name: a, dtype: object

In [31]:
# single row
print(movies.iloc[2])
type(movies.iloc[2])

title_x                          The Accidental Prime Minister (film)
imdb_id                                                     tt6986710
poster_path         https://upload.wikimedia.org/wikipedia/en/thum...
wiki_link           https://en.wikipedia.org/wiki/The_Accidental_P...
title_y                                 The Accidental Prime Minister
original_title                          The Accidental Prime Minister
is_adult                                                            0
year_of_release                                                  2019
runtime                                                           112
genres                                                Biography|Drama
imdb_rating                                                       6.1
imdb_votes                                                       5549
story               Based on the memoir by Indian policy analyst S...
summary             Explores Manmohan Singh's tenure as the Prime ...
tagline             

pandas.core.series.Series

In [32]:
# multiple rows
print(type(movies.iloc[0:5]))
movies.iloc[0:5]

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,title_x,imdb_id,poster_path,wiki_link,title_y,original_title,is_adult,year_of_release,runtime,genres,imdb_rating,imdb_votes,story,summary,tagline,actors,wins_nominations,release_date
0,Uri: The Surgical Strike,tt8291224,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Uri:_The_Surgica...,Uri: The Surgical Strike,Uri: The Surgical Strike,0,2019,138,Action|Drama|War,8.4,35112,Divided over five chapters the film chronicle...,Indian army special forces execute a covert op...,,Vicky Kaushal|Paresh Rawal|Mohit Raina|Yami Ga...,4 wins,11 January 2019 (USA)
1,Battalion 609,tt9472208,,https://en.wikipedia.org/wiki/Battalion_609,Battalion 609,Battalion 609,0,2019,131,War,4.1,73,The story revolves around a cricket match betw...,The story of Battalion 609 revolves around a c...,,Vicky Ahuja|Shoaib Ibrahim|Shrikant Kamat|Elen...,,11 January 2019 (India)
2,The Accidental Prime Minister (film),tt6986710,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/The_Accidental_P...,The Accidental Prime Minister,The Accidental Prime Minister,0,2019,112,Biography|Drama,6.1,5549,Based on the memoir by Indian policy analyst S...,Explores Manmohan Singh's tenure as the Prime ...,,Anupam Kher|Akshaye Khanna|Aahana Kumra|Atul S...,,11 January 2019 (USA)
3,Why Cheat India,tt8108208,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Why_Cheat_India,Why Cheat India,Why Cheat India,0,2019,121,Crime|Drama,6.0,1891,The movie focuses on existing malpractices in ...,The movie focuses on existing malpractices in ...,,Emraan Hashmi|Shreya Dhanwanthary|Snighdadeep ...,,18 January 2019 (USA)
4,Evening Shadows,tt6028796,,https://en.wikipedia.org/wiki/Evening_Shadows,Evening Shadows,Evening Shadows,0,2018,102,Drama,7.3,280,While gay rights and marriage equality has bee...,Under the 'Evening Shadows' truth often plays...,,Mona Ambegaonkar|Ananth Narayan Mahadevan|Deva...,17 wins & 1 nomination,11 January 2019 (India)


Now, fancy indexing can also be used to fetch any number of rows.

In [33]:
movies.iloc[[0, 4, 5]]

Unnamed: 0,title_x,imdb_id,poster_path,wiki_link,title_y,original_title,is_adult,year_of_release,runtime,genres,imdb_rating,imdb_votes,story,summary,tagline,actors,wins_nominations,release_date
0,Uri: The Surgical Strike,tt8291224,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Uri:_The_Surgica...,Uri: The Surgical Strike,Uri: The Surgical Strike,0,2019,138,Action|Drama|War,8.4,35112,Divided over five chapters the film chronicle...,Indian army special forces execute a covert op...,,Vicky Kaushal|Paresh Rawal|Mohit Raina|Yami Ga...,4 wins,11 January 2019 (USA)
4,Evening Shadows,tt6028796,,https://en.wikipedia.org/wiki/Evening_Shadows,Evening Shadows,Evening Shadows,0,2018,102,Drama,7.3,280,While gay rights and marriage equality has bee...,Under the 'Evening Shadows' truth often plays...,,Mona Ambegaonkar|Ananth Narayan Mahadevan|Deva...,17 wins & 1 nomination,11 January 2019 (India)
5,Soni (film),tt6078866,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Soni_(film),Soni,Soni,0,2018,97,Drama,7.2,1595,Soni a young policewoman in Delhi and her su...,While fighting crimes against women in Delhi ...,,Geetika Vidya Ohlyan|Saloni Batra|Vikas Shukla...,3 wins & 5 nominations,18 January 2019 (USA)


### `loc`

In [34]:
students.set_index('name', inplace=True)

In [35]:
students.loc['nool']

iq       100
marks     80
pkg       10
Name: nool, dtype: int64

In [36]:
# fancy indexing using loc
students.loc[['nool', 'rupaid', 'amnit']]

Unnamed: 0_level_0,iq,marks,pkg
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
nool,100,80,10
rupaid,120,100,14
amnit,0,0,0


## Selecting both Rows and Columns

Suppose we want to fetch the data of first 3 movies, but only require the information from the first three columns:

In [37]:
movies.iloc[0:3, 0:3]

Unnamed: 0,title_x,imdb_id,poster_path
0,Uri: The Surgical Strike,tt8291224,https://upload.wikimedia.org/wikipedia/en/thum...
1,Battalion 609,tt9472208,
2,The Accidental Prime Minister (film),tt6986710,https://upload.wikimedia.org/wikipedia/en/thum...


In [38]:
# similar thing using loc
movies.loc[0:3, 'title_x': 'poster_path']

Unnamed: 0,title_x,imdb_id,poster_path
0,Uri: The Surgical Strike,tt8291224,https://upload.wikimedia.org/wikipedia/en/thum...
1,Battalion 609,tt9472208,
2,The Accidental Prime Minister (film),tt6986710,https://upload.wikimedia.org/wikipedia/en/thum...
3,Why Cheat India,tt8108208,https://upload.wikimedia.org/wikipedia/en/thum...


## Filtering a DataFrame
Here, we will discuss about selecting some specific rows and columns based upon the requirements.

In [39]:
ipl.head(2)

Unnamed: 0,ID,City,Date,Season,MatchNumber,Team1,Team2,Venue,TossWinner,TossDecision,SuperOver,WinningTeam,WonBy,Margin,method,Player_of_Match,Team1Players,Team2Players,Umpire1,Umpire2
0,1312200,Ahmedabad,2022-05-29,2022,Final,Rajasthan Royals,Gujarat Titans,"Narendra Modi Stadium, Ahmedabad",Rajasthan Royals,bat,N,Gujarat Titans,Wickets,7.0,,HH Pandya,"['YBK Jaiswal', 'JC Buttler', 'SV Samson', 'D ...","['WP Saha', 'Shubman Gill', 'MS Wade', 'HH Pan...",CB Gaffaney,Nitin Menon
1,1312199,Ahmedabad,2022-05-27,2022,Qualifier 2,Royal Challengers Bangalore,Rajasthan Royals,"Narendra Modi Stadium, Ahmedabad",Rajasthan Royals,field,N,Rajasthan Royals,Wickets,7.0,,JC Buttler,"['V Kohli', 'F du Plessis', 'RM Patidar', 'GJ ...","['YBK Jaiswal', 'JC Buttler', 'SV Samson', 'D ...",CB Gaffaney,Nitin Menon


### Task: Find all the final winners

In [40]:
# find all the final winners
mask = ipl['MatchNumber'] == 'Final'
finalMatch_df = ipl[mask]
finalMatch_df[['Season', 'Team1', 'Team2', 'WinningTeam']]

Unnamed: 0,Season,Team1,Team2,WinningTeam
0,2022,Rajasthan Royals,Gujarat Titans,Gujarat Titans
74,2021,Chennai Super Kings,Kolkata Knight Riders,Chennai Super Kings
134,2020/21,Delhi Capitals,Mumbai Indians,Mumbai Indians
194,2019,Mumbai Indians,Chennai Super Kings,Mumbai Indians
254,2018,Sunrisers Hyderabad,Chennai Super Kings,Chennai Super Kings
314,2017,Mumbai Indians,Rising Pune Supergiant,Mumbai Indians
373,2016,Royal Challengers Bangalore,Sunrisers Hyderabad,Sunrisers Hyderabad
433,2015,Mumbai Indians,Chennai Super Kings,Mumbai Indians
492,2014,Kolkata Knight Riders,Kings XI Punjab,Kolkata Knight Riders
552,2013,Chennai Super Kings,Mumbai Indians,Mumbai Indians


Or, we can simply sum up the above code in the following manner:

In [41]:
ipl[ipl['MatchNumber'] == 'Final'][['Season', 'Team1', 'Team2', 'WinningTeam']]

Unnamed: 0,Season,Team1,Team2,WinningTeam
0,2022,Rajasthan Royals,Gujarat Titans,Gujarat Titans
74,2021,Chennai Super Kings,Kolkata Knight Riders,Chennai Super Kings
134,2020/21,Delhi Capitals,Mumbai Indians,Mumbai Indians
194,2019,Mumbai Indians,Chennai Super Kings,Mumbai Indians
254,2018,Sunrisers Hyderabad,Chennai Super Kings,Chennai Super Kings
314,2017,Mumbai Indians,Rising Pune Supergiant,Mumbai Indians
373,2016,Royal Challengers Bangalore,Sunrisers Hyderabad,Sunrisers Hyderabad
433,2015,Mumbai Indians,Chennai Super Kings,Mumbai Indians
492,2014,Kolkata Knight Riders,Kings XI Punjab,Kolkata Knight Riders
552,2013,Chennai Super Kings,Mumbai Indians,Mumbai Indians


### Task: How many super over finishes have occured?

In [42]:
# strategy 1
(ipl['SuperOver'] == 'Y').sum()

np.int64(14)

In [43]:
# strategy 2
ipl[ipl['SuperOver'] == 'Y'].shape[0]

14

### Task: How many matches have "Chennai Super Kings" won in **Kolkata**?

In [58]:
ipl[(ipl['City'] == 'Kolkata') & (ipl['WinningTeam'] == 'Chennai Super Kings')].shape[0]

5

### Task: The percentage of **Toss Winner** also being the **Match Winner**

In [70]:
times_succeeded = ipl[ipl['TossWinner'] == ipl['WinningTeam']].shape[0]
total_matches = ipl.shape[0]
percentage = (times_succeeded / total_matches) * 100
print(f"The percentage of Toss Winner also being the Winning Team is {round(percentage, 2)}%")
print(times_succeeded, total_matches)

The percentage of Toss Winner also being the Winning Team is 51.47%
489 950


### Task: Movies with rating higher than 8 and votes > 10000

In [80]:
movies.head(2)

Unnamed: 0,title_x,imdb_id,poster_path,wiki_link,title_y,original_title,is_adult,year_of_release,runtime,genres,imdb_rating,imdb_votes,story,summary,tagline,actors,wins_nominations,release_date
0,Uri: The Surgical Strike,tt8291224,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Uri:_The_Surgica...,Uri: The Surgical Strike,Uri: The Surgical Strike,0,2019,138,Action|Drama|War,8.4,35112,Divided over five chapters the film chronicle...,Indian army special forces execute a covert op...,,Vicky Kaushal|Paresh Rawal|Mohit Raina|Yami Ga...,4 wins,11 January 2019 (USA)
1,Battalion 609,tt9472208,,https://en.wikipedia.org/wiki/Battalion_609,Battalion 609,Battalion 609,0,2019,131,War,4.1,73,The story revolves around a cricket match betw...,The story of Battalion 609 revolves around a c...,,Vicky Ahuja|Shoaib Ibrahim|Shrikant Kamat|Elen...,,11 January 2019 (India)


In [83]:
movies[(movies['imdb_rating'] > 8) & (movies['imdb_votes'] > 10000)].shape[0]

43

### Task: Action movies (but not mandatorily only Action) with rating higher than 7.5

For this task, the way to do this is not a direct one. Since a movie can have **Genres**, we will have to split all the Genres, forming a list of Genres.

In [88]:
mask1 = movies['genres'].str.split('|').apply(lambda x: 'Action' in x)
mask2 = movies['imdb_rating'] > 7.5

Here is a brief explanation of the above code. 
- The 'genres' must be converted to type `str` for split to work on it.
- Then we create a list out of it using `split()`
- We then **apply** our single line logic, via `lambda` function, using the *membership operator* `in` to iterate over the list elements and check if 'Action' exists within or not.
- These filters are stored as variables `mask1` and `mask2`, which is then used along with `&` for picking specific columns out of the DF **movies** in the following code cell.

In [91]:
movies[(mask1) & (mask2)].shape[0]

33

In [94]:
# or we could also create the mask using this strategy
mask1 = movies['genres'].str.contains('Action')
mask2 = movies['imdb_rating'] > 7.5
movies[(mask1) & (mask2)].shape[0]

33

### Task: Write a function that receives the names of 2 teams from the user and returns their track record.

In [95]:
ipl.head(2)

Unnamed: 0,ID,City,Date,Season,MatchNumber,Team1,Team2,Venue,TossWinner,TossDecision,SuperOver,WinningTeam,WonBy,Margin,method,Player_of_Match,Team1Players,Team2Players,Umpire1,Umpire2
0,1312200,Ahmedabad,2022-05-29,2022,Final,Rajasthan Royals,Gujarat Titans,"Narendra Modi Stadium, Ahmedabad",Rajasthan Royals,bat,N,Gujarat Titans,Wickets,7.0,,HH Pandya,"['YBK Jaiswal', 'JC Buttler', 'SV Samson', 'D ...","['WP Saha', 'Shubman Gill', 'MS Wade', 'HH Pan...",CB Gaffaney,Nitin Menon
1,1312199,Ahmedabad,2022-05-27,2022,Qualifier 2,Royal Challengers Bangalore,Rajasthan Royals,"Narendra Modi Stadium, Ahmedabad",Rajasthan Royals,field,N,Rajasthan Royals,Wickets,7.0,,JC Buttler,"['V Kohli', 'F du Plessis', 'RM Patidar', 'GJ ...","['YBK Jaiswal', 'JC Buttler', 'SV Samson', 'D ...",CB Gaffaney,Nitin Menon


In [111]:
teams_list = ipl.apply(lambda row: [row['Team1'], row['Team2']], axis=1)
teams_list

0                     [Rajasthan Royals, Gujarat Titans]
1        [Royal Challengers Bangalore, Rajasthan Royals]
2      [Royal Challengers Bangalore, Lucknow Super Gi...
3                     [Rajasthan Royals, Gujarat Titans]
4                    [Sunrisers Hyderabad, Punjab Kings]
                             ...                        
945             [Kolkata Knight Riders, Deccan Chargers]
946        [Mumbai Indians, Royal Challengers Bangalore]
947                 [Delhi Daredevils, Rajasthan Royals]
948               [Kings XI Punjab, Chennai Super Kings]
949    [Royal Challengers Bangalore, Kolkata Knight R...
Length: 950, dtype: object

In [110]:
teams_list = ipl.apply(lambda row: row["Team1"] + ' ' + row['Team2'], axis=1)
teams_list

0                        Rajasthan Royals Gujarat Titans
1           Royal Challengers Bangalore Rajasthan Royals
2       Royal Challengers Bangalore Lucknow Super Giants
3                        Rajasthan Royals Gujarat Titans
4                       Sunrisers Hyderabad Punjab Kings
                             ...                        
945                Kolkata Knight Riders Deccan Chargers
946           Mumbai Indians Royal Challengers Bangalore
947                    Delhi Daredevils Rajasthan Royals
948                  Kings XI Punjab Chennai Super Kings
949    Royal Challengers Bangalore Kolkata Knight Riders
Length: 950, dtype: object

In [None]:
def trackRecord(team1, team2, ipl_df):
    teams_list = ipl_df.apply(lambda row: [row['Team1'], row['Team2']], axis=1)
    ipl_df

## Adding New Columns
We can add new columns to our Dataframe, either by **creating** a completely new one or by using an old **existing series**.

In [112]:
movies.head(2)

Unnamed: 0,title_x,imdb_id,poster_path,wiki_link,title_y,original_title,is_adult,year_of_release,runtime,genres,imdb_rating,imdb_votes,story,summary,tagline,actors,wins_nominations,release_date
0,Uri: The Surgical Strike,tt8291224,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Uri:_The_Surgica...,Uri: The Surgical Strike,Uri: The Surgical Strike,0,2019,138,Action|Drama|War,8.4,35112,Divided over five chapters the film chronicle...,Indian army special forces execute a covert op...,,Vicky Kaushal|Paresh Rawal|Mohit Raina|Yami Ga...,4 wins,11 January 2019 (USA)
1,Battalion 609,tt9472208,,https://en.wikipedia.org/wiki/Battalion_609,Battalion 609,Battalion 609,0,2019,131,War,4.1,73,The story revolves around a cricket match betw...,The story of Battalion 609 revolves around a c...,,Vicky Ahuja|Shoaib Ibrahim|Shrikant Kamat|Elen...,,11 January 2019 (India)


In [None]:
# Creating a new column, containing the same value for each row:
movies['Country'] = 'India'

In [7]:
# creating one from an existing one
notNullMovies = movies.dropna()
notNullMovies['lead actor'] = notNullMovies['actors'].str.split('|').apply(lambda x: x[0])
notNullMovies.head()

Unnamed: 0,title_x,imdb_id,poster_path,wiki_link,title_y,original_title,is_adult,year_of_release,runtime,genres,imdb_rating,imdb_votes,story,summary,tagline,actors,wins_nominations,release_date,lead actor
11,Gully Boy,tt2395469,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Gully_Boy,Gully Boy,Gully Boy,0,2019,153,Drama|Music,8.2,22440,"Gully Boy is a film about a 22-year-old boy ""M...",A coming-of-age story based on the lives of st...,Apna Time Aayega!,Ranveer Singh|Alia Bhatt|Siddhant Chaturvedi|V...,6 wins & 3 nominations,14 February 2019 (USA),Ranveer Singh
34,Yeh Hai India,tt5525846,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Yeh_Hai_India,Yeh Hai India,Yeh Hai India,0,2017,128,Action|Adventure|Drama,5.7,169,Yeh Hai India follows the story of a 25 years...,Yeh Hai India follows the story of a 25 years...,A Film for Every Indian,Gavie Chahal|Mohan Agashe|Mohan Joshi|Lom Harsh|,2 wins & 1 nomination,24 May 2019 (India),Gavie Chahal
37,Article 15 (film),tt10324144,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Article_15_(film),Article 15,Article 15,0,2019,130,Crime|Drama,8.3,13417,In the rural heartlands of India an upright p...,In the rural heartlands of India an upright p...,Farq Bahut Kar Liya| Ab Farq Laayenge.,Ayushmann Khurrana|Nassar|Manoj Pahwa|Kumud Mi...,1 win,28 June 2019 (USA),Ayushmann Khurrana
87,Aiyaary,tt6774212,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Aiyaary,Aiyaary,Aiyaary,0,2018,157,Action|Thriller,5.2,3538,General Gurinder Singh comes with a proposal t...,After finding out about an illegal arms deal ...,The Ultimate Trickery,Sidharth Malhotra|Manoj Bajpayee|Rakul Preet S...,1 nomination,16 February 2018 (USA),Sidharth Malhotra
96,Raid (2018 film),tt7363076,https://upload.wikimedia.org/wikipedia/en/thum...,https://en.wikipedia.org/wiki/Raid_(2018_film),Raid,Raid,0,2018,122,Action|Crime|Drama,7.4,13159,Set in the 80s in Uttar Pradesh India Raid i...,A fearless income tax officer raids the mansio...,Heroes don't always come in uniform,Ajay Devgn|Saurabh Shukla|Ileana D'Cruz|Amit S...,2 wins & 3 nominations,16 March 2018 (India),Ajay Devgn


## Important DataFrame Functions

In [122]:
ipl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 950 entries, 0 to 949
Data columns (total 20 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ID               950 non-null    int64  
 1   City             899 non-null    object 
 2   Date             950 non-null    object 
 3   Season           950 non-null    object 
 4   MatchNumber      950 non-null    object 
 5   Team1            950 non-null    object 
 6   Team2            950 non-null    object 
 7   Venue            950 non-null    object 
 8   TossWinner       950 non-null    object 
 9   TossDecision     950 non-null    object 
 10  SuperOver        946 non-null    object 
 11  WinningTeam      946 non-null    object 
 12  WonBy            950 non-null    object 
 13  Margin           932 non-null    float64
 14  method           19 non-null     object 
 15  Player_of_Match  946 non-null    object 
 16  Team1Players     950 non-null    object 
 17  Team2Players    

### Astype
`astype()` changes the dtype of a given column. This can be used to reduce the size of a Dataframe, or for similar purposes.

In [None]:
# astype
ipl['ID'] = ipl['ID'].astype('int32')
ipl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 950 entries, 0 to 949
Data columns (total 20 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ID               950 non-null    int32  
 1   City             899 non-null    object 
 2   Date             950 non-null    object 
 3   Season           950 non-null    object 
 4   MatchNumber      950 non-null    object 
 5   Team1            950 non-null    object 
 6   Team2            950 non-null    object 
 7   Venue            950 non-null    object 
 8   TossWinner       950 non-null    object 
 9   TossDecision     950 non-null    object 
 10  SuperOver        946 non-null    object 
 11  WinningTeam      946 non-null    object 
 12  WonBy            950 non-null    object 
 13  Margin           932 non-null    float64
 14  method           19 non-null     object 
 15  Player_of_Match  946 non-null    object 
 16  Team1Players     950 non-null    object 
 17  Team2Players    

Reduced the Dataframe's size by approximately 3.7 KB. Furthermore, we can reduce more size by changing the dtype of the columns where the same values are repeating or has a very limited number of values to type **category**.

In [126]:
ipl['ID'] = ipl['ID'].astype('int32')
ipl['Season'] = ipl['Season'].astype('category')
ipl['Team1'] = ipl['Team1'].astype('category')
ipl['Team2'] = ipl['Team2'].astype('category')
ipl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 950 entries, 0 to 949
Data columns (total 20 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   ID               950 non-null    int32   
 1   City             899 non-null    object  
 2   Date             950 non-null    object  
 3   Season           950 non-null    category
 4   MatchNumber      950 non-null    object  
 5   Team1            950 non-null    category
 6   Team2            950 non-null    category
 7   Venue            950 non-null    object  
 8   TossWinner       950 non-null    object  
 9   TossDecision     950 non-null    object  
 10  SuperOver        946 non-null    object  
 11  WinningTeam      946 non-null    object  
 12  WonBy            950 non-null    object  
 13  Margin           932 non-null    float64 
 14  method           19 non-null     object  
 15  Player_of_Match  946 non-null    object  
 16  Team1Players     950 non-null    object  
 1