## Week 10 Lecture `.ipynb` File

#### Author: Mahmoud Harding

In [None]:
## Import the pandas library and give it an 
## alias pd
import pandas as pd

## Read the CSV file skyscrapers.csv from the data directory 
## and store the data in a DataFrame named skyscrapers
skyscrapers = pd.read_csv('data/skyscrapers.csv')

**Example 1.** Display the information for the `skyscrapers` dataframe.

In [None]:
## Display a concise summary of the skyscrapers DataFrame, 
## including the index, column names, non-null counts, 
## and data types for each column
skyscrapers.info()

**Example 2.** Display the first 5 rows of the `skyscrapers` dataframe.

In [None]:
skyscrapers.head()

**Example 3.** Examine the column names.

In [None]:
## Return the list of column names in the skyscrapers DataFrame
skyscrapers.columns

**Example 4.** Rename columns in the `skyscrapers` `DatFrame`.

In [None]:
## Rename specific columns in the skyscrapers DataFrame:
## location.city is renamed to city,
## statistics.height is renamed to height,
## statistics.floors above is renamed to floors,
## status.completed.year is renamed to year_completed,
## status.started.year is renamed to year_started.
skyscrapers.rename(columns={'location.city': 'city',
                            'statistics.height': 'height',
                            'statistics.floors above': 'floors',
                            'status.completed.year': 'year_completed', 
                            'status.started.year': 'year_started'},
                   
                   ## The inplace=True argument ensures the changes are applied directly 
                   ## to the original DataFrame.
                   inplace=True)

In [None]:
skyscrapers.info()

**Example 5.** How many skyscrapers are there in each country?

In [None]:
skyscrapers.country.value_counts()

## Renaming Specific Values 

This code

```
skyscrapers.country.value_counts()
```

executed without any errors, but the output is incorrect because the same country has been entered in different ways, leading to inconsistent results. For instance:

- USA, US, and United States of America are listed separately, though they refer to the same country.

- United Arab Emirates appears multiple times as United Arab Emirates (UAE) and UAE.

- Malaysia is misspelled as Malasya.

- Saudi Arabia is listed as saudi Arabia (with a lowercase "s").

In [None]:
mask = (
    (skyscrapers['country'] == "US") |
    (skyscrapers['country'] == "USA") |
    (skyscrapers['country'] == "United Sates of America")
)

skyscrapers[mask]

**Example 6.** Reassign the values `US`, `USA`, and `United Sates of America` to `United Sates of America`.

**Step 1.** Fnd the row index labels.

In [None]:
## .loc[12, ] selects the row with the index label 12 from the 'skyscrapers' 
## DataFrame. The empty space after the comma means all columns are selected 
## for that row.
skyscrapers.loc[12, ]

**Step 2.** Access the value in index position 12 to `Unied States of America`.

In [None]:
## After selecting the row by index, ['country'] extracts the value 
## from the 'country' column in that row.
skyscrapers.loc[12, 'country']

**Step 3.** Reassign the value in index position 12 to `United States of America`.

In [None]:
## This assigns the value 'United States of America' to the 'country' 
## column for the row with index label 12.
## It updates the 'country' value specifically for that row.
skyscrapers[12, 'country'] = 'United States of America'

## This retrieves all the columns for the row with index label 12 
## from the 'skyscrapers' DataFrame. It returns the entire row after
## the update.
skyscrapers.loc[12, ]

**Step 4.** Use the `.index` attribute and a `for` loop to print the incorrect values.

In [None]:
## This loop iterates over the index values of the rows in the 
## 'skyscrapers' DataFrame that satisfy the condition defined by 'mask'. 
## 'skyscrapers[mask]' filters the rows, and '.index' gets the index 
## labels of those rows.

## For each index 'i' in the filtered DataFrame, this prints the value 
## from the 'country' column for that specific row using the .loc[] method 
## to access the row by its index.
for i in skyscrapers[mask].index:
    print(skyscrapers.loc[i, 'country'])

**Step 5.** Use the `.index` attribute and a `for` loop to correct the incorrect values.

In [None]:
for i in skyscrapers[mask].index:
    skyscrapers.loc[i, 'country'] = 'United States of America'

**Step 6.** Check the reassigned values.

In [None]:
for i in skyscrapers[mask].index:
    print(skyscrapers.loc[i, 'country'])

Now that we have the correct counts for the United States of America.

In [None]:
skyscrapers.country.value_counts()

The country names for the remaining errors can be corrected using the same technique.

## Visualization

Another exploration technique is visualization, where we can apply the same data moves concepts. We'll explore common visualization methods using the `skyscrapers` dataframe, leveraging plotting features from both `pandas` and `matplotlib`. 

Run the cell below to import the necessary libraries for creating our visualizations.

**Note:** To learn more about visualizations in `matplotlib` click [here](https://matplotlib.org/) and for documentation on creating visualizations using `pandas` click [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html).

### Bar Chart

A bar chart is a graphical representation of categorical data where each category is represented by a bar, with the height or length of the bar corresponding to its value. It is used to compare the frequency, count, or other metrics across different categories. Bar charts are ideal when you need to visually compare discrete categories or show trends over time. 

This can be done using `pandas`.

```python
df.plot.bar()
```

or using `matplotlib`

```python
plt.bar()
```

Let's import the the `matplotlib.pyplot` library.

In [None]:
import matplotlib.pyplot as plt

Now we can make a bar chart for the counts of the countries.

In [None]:
## The 'value_counts()' function counts the occurrences of each unique value
## in the 'country' column. It returns a Series where the index is the unique values
## (in this case, countries), and the values are the frequency counts.
## The result is stored in the variable 'tbl', representing the number of skyscrapers 
## per country.
tbl = skyscrapers['country'].value_counts()

## A slice of the first 3 rows of the frequency table 'tbl',
## showing the top 3 countries with the highest number of skyscrapers.
tbl[:3]

In [None]:
## A bar chart for the top 3 most frequent countries from the 'tbl' Series.
## 'plot.bar()' creates a bar plot with the country names on the x-axis and 
## the frequency (number of skyscrapers) on the y-axis.
tbl[:3].plot.bar()

## Add a title to the bar plot with the text "Skyscraper Frequency".
plt.title("Skyscraper Frequency");

In [None]:
## Creates a bar chart for the top 3 countries in 'tbl', with horizontal
## x-axis labels. The result is assigned to 'ax', whichs allows
## the plot to be further customized.
ax = tbl[:3].plot.bar(rot = 1)

## Change the bin labels
ax.set_xticklabels(['China', 'USA', 'UAE']):

## This manually sets the x-axis labels to 'China', 'USA', and 'UAE' instead of 
## the default labels from the data.
ax.set_xticklabels(['China', 'USA', 'UAE'])

## Adds a title "Skyscraper Frequency" to the bar plot.
plt.title("Skyscraper Frequency");

### Histogram

A histogram is a graphical representation of the distribution of numerical data. Unlike a bar chart, which displays categorical data, a histogram groups continuous data into bins (ranges) and shows the frequency or count of data points within each bin. The height of each bar represents the number of data points that fall within each bin.

Let's visualize the number of floors.

In [None]:
## Create a histogram for the 'floors' column in the 'skyscrapers' DataFrame.
## The histogram displays the distribution of the number of floors across 
## all skyscrapers, showing how frequently different ranges of floor counts appear.
skyscrapers['floors'].hist()

In [None]:
## The 'edgecolor="white"' argument adds white borders around each bar in 
## the histogram.
skyscrapers['floors'].hist(edgecolor="white");

In [None]:
ax = skyscrapers['floors'].hist(edgecolor="white")

## Disables the grid lines on the histogram, removing the default grid.
ax.grid(False)

## ax.set_title("Distribution of the of Floors"):
## ax.set_xlabel("Number of floors"):
## Label the x-axis as "Number of floors"
## Label the y-axis as "Count"
ax.set_title("Distribution of the of Floors")
ax.set_xlabel("Number of floors")
ax.set_ylabel("Count");

### Box Plot

A boxplot is a way of displaying the distribution of data based on a five-number summary. 

- Median (Q2): The line inside the box represents the median (the middle value of the data set).

- First quartile (Q1): The lower edge of the box, representing the 25th percentile (where 25% of the data lies below this value).

- Third quartile (Q3): The upper edge of the box, representing the 75th percentile (where 75% of the data lies below this value).

- Interquartile Range (IQR): The range between the first quartile (Q1) and third quartile (Q3), which contains the middle 50% of the data.

- Whiskers: These extend from the edges of the box to the smallest and largest values within 1.5 times the IQR from Q1 and Q3, respectively. They represent the range of most of the data.

- Outliers: Data points outside the whiskers are considered outliers and are usually plotted as individual dots.


It helps to visually show the range, spread, and skewness of the data, as well as any potential outliers.

In [None]:
## This creates a boxplot for the 'floors' column, separated by the 'country' column.

## 'column='floors'' specifies that the boxplot is for the 'floors' 
## column (numerical data).

## 'by='country'' groups the data by the 'country' column (categorical data) 
## and creates separate boxplots for each unique value in the 'country' column.
skyscrapers.boxplot(column='floors', by='country');

**Example 7.** Is there anything wrong with this visualization? Turn to your neighbor and discuss your thoughts.

Let’s focus on visualizing only the top three countries.

In [None]:
## A query to filter rows where the country is either
## 'United States of America', 'China', or 'United Arab Emirates'
q = 'country == "United States of America" or ' + \
    'country == "China" or ' + \
    'country == "United Arab Emirates"'

## Apply the query to the 'skyscrapers' DataFrame to get the filtered data
df = skyscrapers.query(q)

## Creating a boxplot for the 'floors' column, grouped by the 'country' column
df.boxplot(column='floors', by='country');

In [None]:
ax = df.boxplot(column='floors', by='country', grid=False)

# Add a title and customize axis labels
ax.set_title("Floors Distribution by Country")
ax.set_xlabel("")
ax.set_ylabel("Number of Floors")

# Remove the default title generated by pandas
plt.suptitle("");

The box plot helps visualize the distribution of `floors` for each country.

**Note:** The box plot may be misleading due to missing observations from the UAE caused by data entry errors.

### Scatter Plot

A scatterplot is a graphical representation used to display the relationship between two continuous variables. Each point on the plot represents an observation, with the position of the point determined by the values of the two variables. The $x-$axis represents one variable, and the $y-$axis represents the other.
Let's visualize the number of floors.

In [None]:
## Create a scatter plot with the 'year_completed' column on the x-axis and 
## the 'floors' column on the y-axis.
skyscrapers.plot.scatter(x='year_completed', y='floors');

**Example 8.** How would you describe the relationship between time and the number of floors? Is it positive, negative, or neutral?

Describe your answer within the context of the data in represented in the visualization.

In [None]:
ax = skyscrapers.plot.scatter(x='year_completed', y='floors', title='Year Completed vs. Floors')

# Adding a vertical lines at x=2003 and x=2010
plt.axvline(x=2003, color='red')
plt.axvline(x=2010, color='red');

**Example 9.** What do the red lines represent in the scatter plot?

If we want to emphasize trends over time with the year on the $x-$axis, a line chart would be a more effective choice.

In [None]:
## Create a line chart where the 'year_completed' column is on the x-axis
## and the 'floors' column is on the y-axis. 
## The line connects the data points in order.
skyscrapers.plot.line(x='year_completed', y='floors');

**Example 10.** Since there are multiple observations for the same year, we need to group the data by year and choose a statistical measure to calculate from each group.

When grouping data by year (or any other category), you can choose from several common statistical measures depending on what insights you're looking to extract. Here are some options:

- Mean

   ```python
   skyscrapers.groupby('year_completed')['floors'].mean()
   ```
<br>

- Median

   ```python
   skyscrapers.groupby('year_completed')['floors'].median()
   ```
<br>
- Sum

   ```python
   skyscrapers.groupby('year_completed')['floors'].sum()
   ```
<br>
- Count

   ```python
   skyscrapers.groupby('year_completed')['floors'].count()
   ```
<br>
- Min/Max

   ```python
   skyscrapers.groupby('year_completed')['floors'].min()
   
   skyscrapers.groupby('year_completed')['floors'].max()
   ```
<br>
The statistical measure you choose is based on what you're trying to analyze. For example, the mean or median works well for central tendencies, while min/max are effective when you want to highlight extreme cases or understand the range of values in your grouped data.

To better understand this process, let’s break down the code step by step.

```python
skyscrapers.groupby('year_completed')['floors'].max()
```

- `skyscrapers.groupby('year_completed')` groups the skyscrapers DataFrame by the values in the `year_completed` column. Each unique value in year_completed forms a group.

- After grouping by `year_completed`, `['floors']` selects the floors column from the `skyscrapers` `DataFrame` for further operations.

- Finally, `.max()` computes the maximum value of the floors column for each group (each unique `year_completed`). It returns the highest number of floors for the skyscrapers completed in each year as a `Series` whose indices are represented by year.

In [None]:
skyscrapers.groupby('year_completed')['floors'].max()

In [None]:
df = skyscrapers.groupby('year_completed')['floors'].max()

df.plot.line(title="Max Floors by Year");

In [None]:
## .loc[2000:] selects a slice of data starting from the year 2000 onward
## The .loc[2000:] operation uses label-based indexing, meaning it selects all 
## rows where the year_completed is 2000 or greater.
df = skyscrapers.groupby('year_completed')['floors'].max().loc[2000:, ]

df.plot.line(title="Max Floors by Year");

In [None]:
## [10:] uses positional indexing and selects a slice of the data starting from the 
## 10th position (index 10) onward.
## Unlike .loc, which selects rows based on labels, [10:] selects rows based on their
## position in the result, regardless of the actual values in the year_completed column.
df = skyscrapers.groupby('year_completed')['floors'].max()[10:]

## The figsize=(12, 6) parameter in plotting functions (like in matplotlib and pandas plotting) 
## specifies the size of the plot in inches.
## 12 is the width of the plot in inches and 6 is the height of the plot in inches.
df.plot.line(title="Max Floors by Year", figsize=(12, 6));

In [None]:
df = skyscrapers.groupby('year_completed')['floors'].mean()[10:]
df.plot.line(title="Mean Floors by Year", figsize=(12, 6));

In [None]:
df = skyscrapers.groupby('year_completed')['floors'].median()[10:]
df.plot.line(title="Median Floors by Year", figsize=(12, 6));