<a href="https://colab.research.google.com/github/krauseannelize/nb-py-ms-exercises/blob/sprint03/notebooks/s03_pandas_foundation/35_exercises_pandas_dataframe.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 35 | Exercises - Pandas DataFrame

A **DataFrame** is a 2D labeled data structure in Python, provided by the `Pandas` library.

## What is a DataFrame?

- Each column can hold different types of data (numbers, text, dates)
- Every row and column has a label
- You can perform complex operations with simple commands

## Why use DataFrames?

**DataFrames** are like spreadsheets in Python. They are easy to read, flexible with different data types, and powerful for analysis. You can filter, calculate, and summarize data with just a few lines of code.

## Importing `Pandas`

**DataFrames** are part of the `Pandas` library that needs to be imported.

In [1]:
import pandas as pd

## Creating a DataFrame

**DataFrames** can be created in multiple ways. Two common approaches are using a **list** and a **dictionary**.

### DataFrame from a List

When creating a **DataFrame** from a simple list, the list is interpreted as a column of data. It is passed as the first argument to `pd.DataFrame()`, and a column name can optionally be specified using the `columns` parameter.

In [9]:
# creating a list with destinations
ls_destinations = ["Paris", "Tokyo", "New York"]
print(f"The list:\n{ls_destinations}\n")

# converting a list to a DataFrame
df_destinations = pd.DataFrame(ls_destinations)
print(f"The DataFrame:\n{df_destinations}\n")

# using the "column" parameter
df_column_name = pd.DataFrame(ls_destinations, columns=["Destination"])
print(f"The DataFrame with a column name specified:\n{df_column_name}")

The list:
['Paris', 'Tokyo', 'New York']

The DataFrame:
          0
0     Paris
1     Tokyo
2  New York

The DataFrame with a column name specified:
  Destination
0       Paris
1       Tokyo
2    New York


### DataFrame from a Dictionary

When creating a **DataFrame** from a dictionary, the keys of the dictionary represent the column names, and the values represent the data for those columns.

In [10]:
# creating a dictionary with travel details
dict_travel_info = {
    "Destination": ["Paris", "Tokyo", "New York"],
    "Country": ["France", "Japan", "USA"],
    "Rating": [4.8, 4.7, 4.5]
}
print(f"The dictionary:\n{dict_travel_info}\n")

# converting a dictionary to a DataFrame
df_travel_info = pd.DataFrame(dict_travel_info)
print(f"The DataFrame:\n{df_travel_info}")

The dictionary:
{'Destination': ['Paris', 'Tokyo', 'New York'], 'Country': ['France', 'Japan', 'USA'], 'Rating': [4.8, 4.7, 4.5]}

The DataFrame:
  Destination Country  Rating
0       Paris  France     4.8
1       Tokyo   Japan     4.7
2    New York     USA     4.5


## Inspecting a DataFrame's Structure

Pandas provides several attributes to quickly inspect the characteristics of a **DataFrame**, such as:

| Attribute | Description |
| --- | --- |
| `dtypes` | Displays the data types of each column |
| `columns` | Lists the column names of the DataFrame |
| `shape` | Returns the number of rows and columns as a tuple |

In [11]:
# view column data types
print(f"Column data types:\n{df_travel_info.dtypes}\n")

# view column names
print(f"Column names:\n{df_travel_info.columns}\n")

# view number of rows and columns
print(f"Number of rows and columns:\n{df_travel_info.shape}")

Column data types:
Destination     object
Country         object
Rating         float64
dtype: object

Column names:
Index(['Destination', 'Country', 'Rating'], dtype='object')

Number of rows and columns:
(3, 3)


## Working with External Datasets

Pandas makes it easy to load, explore, and manipulate your data no matter if your data is stored in a CSV or Excel file, or is preloaded from libraries like `Seaborn` or `Scikit-learn`.

### Creating DataFrames from CSV Files

Use the `pd.read_csv()` function to read CSV (Comma-Separated Values) files into a **DataFrame**.

```python
# basic syntax
pd.read_csv(filepath, sep=',', header=0)
```

- `filepath`: Path to CSV file
- `sep`: Delimiter separating values.
- `header`: Row containing column names (default is first row)

```python
# example
df_travel = pd.read_csv("travel_destinations.csv")
```

### Creating DataFrames from Excel Files

Use the `pd.read_excel()` function to load data from Excel spreadsheets into a **DataFrame**.

```python
# basic syntax
pd.read_excel(filepath, sheet_name=0, header=0)
```

- `filepath`: Path to Excel file
- `sheet_name`: Name or index of sheet to load (default is first sheet).
- `header`: Row containing column names (default is first row).

```python
# example
df_travel = pd.read_excel("travel_destinations.xlsx", sheet_name="Sheet1")
```

### Using Preloaded Datasets from Other Libraries

Many Python libraries come with preloaded datasets for practice and experimentation. These can be loaded into `Pandas` **DataFrames** with the provided methods and functions. Alternatively, you can apply `pd.DataFrame()` method to transform the dataset into a **DataFrame**.

In [15]:
# preloaded dataset from seaborn library
import seaborn as sns
df_tips = sns.load_dataset("tips")

# head() displays first 5 rows of DataFrame by default or first x rows when specified
df_tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [16]:
# preload dataset from scikit-learn library
from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris()
df_iris = pd.DataFrame(iris.data, columns=iris.feature_names)

# tail() displays last 5 rows of DataFrame by default or first x rows when specified
print(df_iris.tail(8))

     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
142                5.8               2.7                5.1               1.9
143                6.8               3.2                5.9               2.3
144                6.7               3.3                5.7               2.5
145                6.7               3.0                5.2               2.3
146                6.3               2.5                5.0               1.9
147                6.5               3.0                5.2               2.0
148                6.2               3.4                5.4               2.3
149                5.9               3.0                5.1               1.8


## DataFrame Exploration and Summary Statistics

### Viewing the first few rows

The `.head()` method displays the first 5 rows of **DataFrame** by default or the first number of rows specified.

In [19]:
# creating a dictionary with travel details
dict_travel_info = {
    "Destination": ["Paris", "Tokyo", "New York", "London", "Cape Town"],
    "Country": ["France", "Japan", "USA", "UK", "South Africa"],
    "Rating": [4.8, 4.7, 4.5, 4.6, 4.9]
}
print(f"The dictionary:\n{dict_travel_info}\n")

# converting a dictionary to a DataFrame
df_travel_info = pd.DataFrame(dict_travel_info)
print(f"The DataFrame:\n{df_travel_info}\n")

# display the first 3 rows only
print(f"The first 3 rows only:\n{df_travel_info.head(3)}")

The dictionary:
{'Destination': ['Paris', 'Tokyo', 'New York', 'London', 'Cape Town'], 'Country': ['France', 'Japan', 'USA', 'UK', 'South Africa'], 'Rating': [4.8, 4.7, 4.5, 4.6, 4.9]}

The DataFrame:
  Destination       Country  Rating
0       Paris        France     4.8
1       Tokyo         Japan     4.7
2    New York           USA     4.5
3      London            UK     4.6
4   Cape Town  South Africa     4.9

The first 3 rows only:
  Destination Country  Rating
0       Paris  France     4.8
1       Tokyo   Japan     4.7
2    New York     USA     4.5


### Viewing the last few rows

The `.tail()` method displays the first 5 rows of **DataFrame** by default or the first number of rows specified.

In [20]:
# creating a dictionary with travel details
dict_travel_info = {
    "Destination": ["Paris", "Tokyo", "New York", "London", "Cape Town"],
    "Country": ["France", "Japan", "USA", "UK", "South Africa"],
    "Rating": [4.8, 4.7, 4.5, 4.6, 4.9]
}
print(f"The dictionary:\n{dict_travel_info}\n")

# converting a dictionary to a DataFrame
df_travel_info = pd.DataFrame(dict_travel_info)
print(f"The DataFrame:\n{df_travel_info}\n")

# display the last 3 rows only
print(f"The last 3 rows only:\n{df_travel_info.tail(3)}")

The dictionary:
{'Destination': ['Paris', 'Tokyo', 'New York', 'London', 'Cape Town'], 'Country': ['France', 'Japan', 'USA', 'UK', 'South Africa'], 'Rating': [4.8, 4.7, 4.5, 4.6, 4.9]}

The DataFrame:
  Destination       Country  Rating
0       Paris        France     4.8
1       Tokyo         Japan     4.7
2    New York           USA     4.5
3      London            UK     4.6
4   Cape Town  South Africa     4.9

The last 3 rows only:
  Destination       Country  Rating
2    New York           USA     4.5
3      London            UK     4.6
4   Cape Town  South Africa     4.9


### Get Metadata About the DataFrame

The `.info()` method displays all the information about the **DataFrame**'s structure in one view, including:

- Number of rows and columns
- Column names and their data types
- Number of non-null (non-missing) values in each column
- Memory usage of DataFrame

In [21]:
# creating a dictionary with travel details
dict_travel_info = {
    "Destination": ["Paris", "Tokyo", "New York", "London", "Cape Town"],
    "Country": ["France", "Japan", "USA", "UK", "South Africa"],
    "Rating": [4.8, 4.7, 4.5, 4.6, 4.9]
}
print(f"The dictionary:\n{dict_travel_info}\n")

# converting a dictionary to a DataFrame
df_travel_info = pd.DataFrame(dict_travel_info)
print(f"The DataFrame:\n{df_travel_info}\n")

# display DataFrame metadata
print(f"The metadata is:\n{df_travel_info.info()}")

The dictionary:
{'Destination': ['Paris', 'Tokyo', 'New York', 'London', 'Cape Town'], 'Country': ['France', 'Japan', 'USA', 'UK', 'South Africa'], 'Rating': [4.8, 4.7, 4.5, 4.6, 4.9]}

The DataFrame:
  Destination       Country  Rating
0       Paris        France     4.8
1       Tokyo         Japan     4.7
2    New York           USA     4.5
3      London            UK     4.6
4   Cape Town  South Africa     4.9

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Destination  5 non-null      object 
 1   Country      5 non-null      object 
 2   Rating       5 non-null      float64
dtypes: float64(1), object(2)
memory usage: 252.0+ bytes
The metadata is:
None


### Counting the Number of Unique Values

The `.nunique()` method returns the _number of distinct values_ in a **Series** or a column of a **DataFrame**.

```python
# basic syntax
series.nunique(dropna=True)
```

- `dropna`: If `True`, excludes `NaN` values from the count (default is `True`).

In [26]:
dict_parks = {
    "Park": ["Yellowstone", "Banff", "Kruger",
             "Yellowstone", "Serengeti", "Banff"],
    "Country": ["USA", "Canada", "South Africa",
                "USA", "Tanzania", "Canada"]
}
df_parks = pd.DataFrame(dict_parks)

unique_park_count = df_parks["Park"].nunique()
print(unique_park_count)

4


### Counting Unique Value Frequency

The `.value_counts()` method returns the _frequency count_ of unique values in a **Series** or a column of a **DataFrame**.

```python
# basic syntax
series.value_counts(normalize=False, sort=True)
```

- `normalize`: If `True`, returns the proportion of each value instead of raw counts.
- `sort`: If `True`, sorts counts in descending order (default behavior).

In [24]:
dict_parks = {
    "Park": ["Yellowstone", "Banff", "Kruger",
             "Yellowstone", "Serengeti", "Banff"],
    "Country": ["USA", "Canada", "South Africa",
                "USA", "Tanzania", "Canada"]
}
df_parks = pd.DataFrame(dict_parks)

park_counts = df_parks["Park"].value_counts()
print(park_counts)

Park
Yellowstone    2
Banff          2
Kruger         1
Serengeti      1
Name: count, dtype: int64


### Viewing Unique Values

The `.unique()` method returns an array of all unique values in a **Series** or a column of a **DataFrame**.

In [25]:
dict_parks = {
    "Park": ["Yellowstone", "Banff", "Kruger",
             "Yellowstone", "Serengeti", "Banff"],
    "Country": ["USA", "Canada", "South Africa",
                "USA", "Tanzania", "Canada"]
}
df_parks = pd.DataFrame(dict_parks)

unique_countries = df_parks["Country"].unique()
print(unique_countries)

['USA' 'Canada' 'South Africa' 'Tanzania']


## Exercise 1

Create a `Pandas` **DataFrame** containing information about the **five highest mountains in the world**. Follow these steps:

- Use the data dictionary provided, that contains the following columns:
  - `Mountain`: The name of the mountain.
  - `Height_m`: The height of the mountain in meters.
  - `Location`: The location (countries or regions) of the mountain.
  - `First_Ascent`: The year of the first recorded ascent.
- Create a `Pandas` **DataFrame**.
- Print the following pandas attributes:
  - `dtypes`: Check the data types of the columns.
  - `columns`: Display the column names in the DataFrame.
  - `shape`: Print the number of rows and columns in the DataFrame.

  ```python
  data = {
    'Mountain': ['Mount Everest', 'K2', 'Kangchenjunga', 'Lhotse',
                 'Makalu'],
    'Height_m': [8848, 8611, 8586, 8516, 8485],
    'Location': ['Nepal/China', 'Pakistan/China', 'Nepal/India',
                 'Nepal/China', 'Nepal/China'],
    'First_Ascent': [1953, 1954, 1955, 1956, 1955],
    }
  ```

In [14]:
data = {
    'Mountain': ['Mount Everest', 'K2', 'Kangchenjunga', 'Lhotse',
                 'Makalu'],
    'Height_m': [8848, 8611, 8586, 8516, 8485],
    'Location': ['Nepal/China', 'Pakistan/China', 'Nepal/India',
                 'Nepal/China', 'Nepal/China'],
    'First_Ascent': [1953, 1954, 1955, 1956, 1955],
    }

# creating a DataFrame
df_mountains = pd.DataFrame(data)
print(f"The DataFrame:\n{df_mountains}\n")

# exploring the DataFrame
# view column data types
print(f"Column data types:\n{df_mountains.dtypes}\n")

# view column names
print(f"Column names:\n{df_mountains.columns}\n")

# view number of rows and columns
print(f"Number of rows and columns:\n{df_mountains.shape}")

The DataFrame:
        Mountain  Height_m        Location  First_Ascent
0  Mount Everest      8848     Nepal/China          1953
1             K2      8611  Pakistan/China          1954
2  Kangchenjunga      8586     Nepal/India          1955
3         Lhotse      8516     Nepal/China          1956
4         Makalu      8485     Nepal/China          1955

Column data types:
Mountain        object
Height_m         int64
Location        object
First_Ascent     int64
dtype: object

Column names:
Index(['Mountain', 'Height_m', 'Location', 'First_Ascent'], dtype='object')

Number of rows and columns:
(5, 4)


## Exercise 2

Let’s now analyze a `Pandas` **DataFrame** containing information about the **five highest mountains in Europe**.

- Use `.head(2)` to check the first two rows of the DataFrame.
- Use `.tail(2)` to check the last two rows of the DataFrame.
- Use `.info()` to display a summary of the DataFrame, including column names, non-null counts, and data types.

```python
data = {
    'Mountain': ['Mount Elbrus', 'Dykh-Tau', 'Shkhara', 'Koshtan-Tau',
                 'Pushkin Peak'],
    'Height_m': [5642, 5205, 5193, 5151, 5100],
    'Location': ['Russia', 'Russia', 'Russia/Georgia', 'Russia',
                 'Russia'],
    'First_Ascent': [1874, 1888, 1888, 1889, 1891],
    }
```

In [23]:
# creating a dictionary with mountain information
dict_eu_mountains = {
    'Mountain': ['Mount Elbrus', 'Dykh-Tau', 'Shkhara', 'Koshtan-Tau',
                 'Pushkin Peak'],
    'Height_m': [5642, 5205, 5193, 5151, 5100],
    'Location': ['Russia', 'Russia', 'Russia/Georgia', 'Russia',
                 'Russia'],
    'First_Ascent': [1874, 1888, 1888, 1889, 1891],
    }

# converting a dictionary to a DataFrame
df_eu_mountains = pd.DataFrame(dict_eu_mountains)
print(f"The DataFrame:\n{df_eu_mountains}\n")

# display the first 2 rows only
print(f"The first 2 rows only:\n{df_eu_mountains.head(2)}\n")

# display the last 2 rows only
print(f"The last 2 rows only:\n{df_eu_mountains.tail(2)}\n")

# display DataFrame metadata
print(f"The metadata is:\n{df_eu_mountains.info()}")

The DataFrame:
       Mountain  Height_m        Location  First_Ascent
0  Mount Elbrus      5642          Russia          1874
1      Dykh-Tau      5205          Russia          1888
2       Shkhara      5193  Russia/Georgia          1888
3   Koshtan-Tau      5151          Russia          1889
4  Pushkin Peak      5100          Russia          1891

The first 2 rows only:
       Mountain  Height_m Location  First_Ascent
0  Mount Elbrus      5642   Russia          1874
1      Dykh-Tau      5205   Russia          1888

The last 3 rows only:
       Mountain  Height_m Location  First_Ascent
3   Koshtan-Tau      5151   Russia          1889
4  Pushkin Peak      5100   Russia          1891

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Mountain      5 non-null      object
 1   Height_m      5 non-null      int64 
 2   Location      5 non-null      obje

## Exercise 3

Let’s check the **five highest mountains in Africa** now. However, the dataset includes duplicate entries for some mountains, resulting in 7 or 8 rows. Your job is to analyze the dataset and complete the following tasks:

- Create a pandas DataFrame using the provided data.
- Use `.value_counts()` to count how many times each mountain appears in the dataset.
- Use `.nunique()` to find the number of unique mountain names in the dataset.
- Use `.unique()` to list the names of all unique mountains.

```python
data = {
    "Mountain": [
        "Mount Kilimanjaro", "Mount Kenya", "Mount Stanley",
        "Mount Kenya", "Mount Speke", "Mount Baker",
        "Mount Kilimanjaro", "Mount Stanley"
    ],
    "Height_m": [
        5895, 5199, 5109,
        5199, 4890, 4843,
        5895, 5109
    ],
    "Location": [
        "Tanzania", "Kenya", "Uganda/Congo",
        "Kenya", "Uganda", "Uganda",
        "Tanzania", "Uganda/Congo"
    ]
}
```

In [28]:
# creating a dictionary with mountain information
dict_africa_mountains = {
    "Mountain": [
        "Mount Kilimanjaro", "Mount Kenya", "Mount Stanley",
        "Mount Kenya", "Mount Speke", "Mount Baker",
        "Mount Kilimanjaro", "Mount Stanley"
    ],
    "Height_m": [
        5895, 5199, 5109,
        5199, 4890, 4843,
        5895, 5109
    ],
    "Location": [
        "Tanzania", "Kenya", "Uganda/Congo",
        "Kenya", "Uganda", "Uganda",
        "Tanzania", "Uganda/Congo"
    ]
}

# converting a dictionary to a DataFrame
df_africa_mountains = pd.DataFrame(dict_africa_mountains)
print(f"The DataFrame:\n{df_africa_mountains}\n")

# count the duplicates
mountain_counts = df_africa_mountains["Mountain"].value_counts()
print(f"The number of duplicates is:\n{mountain_counts}\n")

# count number of unique mountains
unique_mountain_count = df_africa_mountains["Mountain"].nunique()
print(f"The number of unique mountains is:\n{unique_mountain_count}\n")

# list all unique mountains
unique_mountains = df_africa_mountains["Mountain"].unique()
print(f"The unique mountains are:\n{unique_mountains}")

The DataFrame:
            Mountain  Height_m      Location
0  Mount Kilimanjaro      5895      Tanzania
1        Mount Kenya      5199         Kenya
2      Mount Stanley      5109  Uganda/Congo
3        Mount Kenya      5199         Kenya
4        Mount Speke      4890        Uganda
5        Mount Baker      4843        Uganda
6  Mount Kilimanjaro      5895      Tanzania
7      Mount Stanley      5109  Uganda/Congo

The number of duplicates is:
Mountain
Mount Kilimanjaro    2
Mount Kenya          2
Mount Stanley        2
Mount Speke          1
Mount Baker          1
Name: count, dtype: int64

The number of unique mountains is:
5

The unique mountains are:
['Mount Kilimanjaro' 'Mount Kenya' 'Mount Stanley' 'Mount Speke'
 'Mount Baker']
