## Week 8 Lecture `.ipynb` File

#### Author: Mahmoud Harding

## Python Basics

### Built-in Functions

- A function that is already available in a programming language/application that can be accessed by end users.

- Returns some value based on its arguments.

- `print`, `abs`, `max`, `min`, `pow`, `round`, etc.

In [None]:
abs(-3)

In [None]:
abs(2-5)

In [None]:
print(3, 10 ** 2, 100.1)

In [None]:
max(3, 10 ** 2, 100.1)

## Nesting Functions

In [None]:
round(abs(1.6002-1.688), 4)

In [None]:
1.6002-1.688

In [None]:
abs(1.6002-1.688)

In [None]:
round(abs(1.6002-1.688), 4)

## Pandas

Pandas is an open source Python package that is most widely used for data science/data analysis and machine learning tasks. It is built on top of another library named `Numpy`, which provides support for arrays. Since we know how to perform operations on `NumPy` arrays we can operate on columns in a `pandas` dataframe. 

Pandas is a fast, powerful, flexible and (sometimes) easy to use open source data analysis and manipulation tool. Click the `Cheat Sheet` below to access the Data Wrangling with `pandas` [Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf).

In [None]:
## Import the pandas library and give it an 
## alias pd
import pandas as pd

## Read the CSV file skyscrapers.csv from the data directory 
## and store the data in a DataFrame named skyscrapers
skyscrapers = pd.read_csv('data/skyscrapers.csv')

**Example 1.** Display the first 5 rows of the `skyscrapers` dataframe.

In [None]:
skyscrapers.head()

### Common `pandas` `DataFrame` Methods

- `.head()`
- `.info()`
- `.describe()`
- `.sample()`

Apply the following methods to the `skyscrapers` `DataFrame` in the examples below.

**Example 2.** `.info()`

In [None]:
## Display a concise summary of the skyscrapers DataFrame, 
## including the index, column names, non-null counts, 
## and data types for each column
skyscrapers.info()

**Example 3.** `.describe()`

In [None]:
## Generate descriptive statistics for the numerical columns in 
## the skyscrapers DataFrame, including count, mean, standard deviation, 
## minimum, quartiles, and maximum values
skyscrapers.describe()

**Example 4.** `.sample()`

In [None]:
## Return a random sample of rows from the skyscrapers DataFrame.
## By default, it returns one row, but the number of rows can 
## be specified as an argument.
skyscrapers.sample()

In [None]:
## Return a random sample of rows from the skyscrapers DataFrame.
## By default, it returns one row, but the number of rows can 
## be specified as an argument.
skyscrapers.sample(5)

### Accessing `pandas` `DataFrame` Attriubtes

- `.shape`
- `.columns`

**Example 5.** `.shape`

In [None]:
## Return the dimensions of the skyscrapers DataFrame as 
## a tuple (number of rows, number of columns)
skyscrapers.shape

**Example 6.** `.columns
`

In [None]:
## Return the list of column names in the skyscrapers DataFrame
skyscrapers.columns

### Rename columns in a `pandas DataFrame`

In [None]:
## Rename specific columns in the skyscrapers DataFrame:
## location.city is renamed to city,
## statistics.height is renamed to height,
## statistics.floors above is renamed to floors,
## status.completed.year is renamed to year_completed,
## status.started.year is renamed to year_started.
skyscrapers.rename(columns={'location.city': 'city',
                            'statistics.height': 'height',
                            'statistics.floors above': 'floors',
                            'status.completed.year': 'year_completed', 
                            'status.started.year': 'year_started'},
                   
                   ## The inplace=True argument ensures the changes are applied directly 
                   ## to the original DataFrame.
                   inplace=True)

In [None]:
skyscrapers.info()

### Accessing columns and values from a `pandas` `DataFrame` and `Series`

**Example 7.** Access the a `name` column from the `skyscrapers` dataframe and return a `Series` type object.

In [None]:
## Returns the values from the name column in the 
## skyscrapers DataFrame as a Series
skyscrapers['name']

### Series

A `pandas` `Series` is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). 

**Source:** [Geeks for Geeks](https://www.geeksforgeeks.org/python-pandas-series/)

**Example 8.** Access a numerical column from the `skyscrapers` dataframe and return a `Series` type object.

In [None]:
## Returns the values from the floors column in the 
## skyscrapers DataFrame as a Series
skyscrapers.floors

Since a `Series` is essentially a 1-dimensional array similar to a `NumPy` `ndarray` with labels, we can can pass a `pandas` `Series` as an argument to many `NumPy` functions.

Let's import `Numpy` and see!

In [None]:
import numpy as np

In [None]:
skyscrapers["floors"]

**Example 9.** What are the average and median number of floors for all skyscrapers in the dataset?

In [None]:
## Calculate the mean (average) number of floors in the 
## skyscrapers dataset
mean_floors = np.mean(skyscrapers['floors'])

# Calculate the median number of floors in the skyscrapers dataset
med_floors = np.median(skyscrapers['floors'])

# Print the rounded mean and median number of floors
print("The average number of floors is", round(mean_floors))
print("The median number of floors is", round(med_floors))

In [None]:
skyscrapers.sort_values(by='floors').iloc[50, ]

### `Series` Attributes and Methods

**Attribute**
 - An attribute of a Series is a property or characteristic that provides information about the `Series` itself.

- Attributes are accessed without parentheses, simply by referencing the attribute name.

- They provide metadata, statistics, or information about the Series but do not perform operations or transformations on the data within the `Series`.

- Examples of Series attributes include `dtype` (data type of the `Series`), `name` (name of the Series), `index` (index labels), and `shape` (shape of the Series).

- Accessing an attribute doesn't require invoking it as a function/method; you access it directly.

**Method**
- A method of a `Series` is a **function** that performs an operation or computation on the data within the `Series`.

- Methods are accessed with parentheses and often accept arguments or parameters to control their behavior.

- Methods manipulate or transform the data and return a result based on the operation performed.

- Examples of `Series` methods include `.sum()` (calculates the sum of elements), `.mean()` (calculates the mean), `.unique()` (returns unique values), and `.apply()` (applies a custom function to each element).

- Accessing a method requires invoking it as a function with parentheses.

**Example 10.** What is the index of the series?

In [None]:
## Access the index (row labels) of the floors column in the 
## skyscrapers DataFrame.
## This returns the row labels, not the values in the floors column.
skyscrapers['floors'].index


**Example 11.** What is the index of the series. average height for all skyscrapers in the dataset.

In [None]:
## Calculate the mean (average) of the values in the floors 
## column of the skyscrapers DataFrame.
skyscrapers.floors.mean()

## Data Moves

### Summarize

We've already explored how to find numerical summaries using the `.describe` method, as well as functions from the `NumPy` library and `pandas` `Series` methods. 

In [None]:
## Generate descriptive statistics for the floors column in the skyscrapers DataFrame.
## This will include the count, mean, standard deviation, minimum, quartiles, and maximum
## for the floors column.
skyscrapers['floors'].describe()

**Example 12.** What are the unique countries where the skyscrapers in the dataset are located.

In [None]:
## Return the unique values from the country column of the skyscrapers DataFrame.
## This will return an array of unique country names, showing each country only once.
skyscrapers.country.unique()

The `value_counts` method in pandas is particularly useful for counting the unique values in a `Series` or `DataFrame` column. It returns a `Series` where the unique values serve as the index, and the corresponding counts represent the frequency of each value, making it ideal for identifying the distribution of categorical data.

**Example 13.** How many skyscrapers are there in each country?

In [None]:
## Count the occurrences of each unique value in the country column of the 
## skyscrapers DataFrame.
## This will return a Series where the index is the unique country names, 
## and the values are the counts of how many times each country appears 
## in the country column, sorted in descending order by default.
skyscrapers.country.value_counts()

This code

```
skyscrapers.country.value_counts()
```

executed without any errors, but the output is incorrect because the same country has been entered in different ways, leading to inconsistent results. For instance:

- USA, US, and United States of America are listed separately, though they refer to the same country.

- United Arab Emirates appears multiple times as United Arab Emirates (UAE) and UAE.

- Malaysia is misspelled as Malasya.

- Saudi Arabia is listed as saudi Arabia (with a lowercase "s").

### Filter

Previoously in the course we used a Boolean mask to filter an R dataframe. We can do the same thing to a pandas dataframe. Read the comments then run each code cell.

In [None]:
## Create a boolean mask that checks if the value in the country column 
## is equal to "United States of America". This returns a Series of True/False
## values, where True indicates a match with the specified country.
skyscrapers['country'] == "United States of America"

In [None]:
## Create a boolean mask that checks if the values in the 'country' column 
## are equal to "United States of America". This mask will be a Series of 
## True/False values, where True indicates rows that match the specified country.
mask = skyscrapers['country'] == "United States of America"

In [None]:
## Filter the skyscrapers DataFrame using the boolean mask.
## This will return a new DataFrame containing only the rows where the mask is True,
## filtering the DataFrame based on the condition defined in the mask.
skyscrapers[mask]

Keep in mind that our dataset contains some data entry inconsistencies, where USA, US, and United States of America are listed separately, even though they all refer to the same country. To ensure we capture all skyscrapers in the USA, we need to use the logical **`OR`** operator (`|`) to filter for all variations of the country name.

**Note:** If we need to meet multiple conditions, we use the logical **`AND`** operator (`&`). For example, to filter and select skyscrapers located in Chicago, United States of America, we would use the following syntax:

```python
(skyscrapers['country'] == "United States of America") & (skyscrapers['city'] == "Chicago")
```

In [None]:
## Create a boolean mask that checks if the value in the 'country' column is 
## either "United States of America" or "USA". The '|' operator is used to 
## apply the logical OR, meaning the mask will return True if either condition is met.
(skyscrapers['country'] == "United States of America") | (skyscrapers['country'] == "USA")

In [None]:
## Create a boolean mask to filter skyscrapers for all variations of the country name.
## The mask checks if the country column is equal to "United States of America", 
## "USA", or "United Sates of America". 

## The logical OR operator (|) ensures that any of these conditions will return True, 
## capturing all relevant rows.
mask = (
    (skyscrapers['country'] == "United States of America") | 
    (skyscrapers['country'] == "USA") |
    (skyscrapers['country'] == "United Sates of America")
)

When using logical operations or expressions over multiple lines, parentheses are often used to group expressions into a block, like this:

```python
       ## Outer parenthesis to start the block
mask = ( 
    (skyscrapers['country'] == "United States of America") | 
    (skyscrapers['country'] == "USA") |
    (skyscrapers['country'] == "United Sates of America")

## Outer parenthesis to end the block
)
```

The parentheses define the boundaries of the block. This allows Python to interpret the entire multi-line expression as a single logical unit, even though it’s split across multiple lines. Without the parentheses, Python would not understand how to group the expressions and might raise a syntax error.

In [None]:
## Filter the skyscrapers DataFrame using the boolean mask.
## This will return a new DataFrame containing only the rows where the mask is True,
## filtering the DataFrame based on the condition defined in the mask.
skyscrapers[mask]

You can also filter a `DataFrame` using the `.query` method. The `.query` method allows you to specify conditions as a string, making the syntax simpler and avoiding the need for complex boolean operators like `&` (and) and `|` (or). This method is especially useful when filtering based on multiple conditions.

In [None]:
## Define a query string to filter the skyscrapers DataFrame for rows where the 'country' 
## column is either "United States of America" or "USA".
q = 'country == "United States of America" or country == "USA" or country == "United Sates of America"'

## Use the .query() method to apply the query string 'q' to the skyscrapers DataFrame.
## This will return a DataFrame with only the rows that match the condition specified in 
## the query string.
skyscrapers.query(q)

In [None]:
## Define a query string that checks if the country column is equal to 
## "United States of America", "USA", or "United Sates of America".
## The string is split into multiple lines for readability and is 
## treated as one complete query.
q = 'country == "United States of America" or ' + \
    'country == "USA" or ' + \
    'country == "United Sates of America"'

q

In [None]:
skyscrapers.query(q)