## Week 8 Lecture `.ipynb` File

#### Author: Mahmoud Harding

## Python Basics

### Built-in Functions

- A function that is already available in a programming language/application that can be accessed by end users.

- Returns some value based on its arguments.

- `print`, `abs`, `max`, `min`, `pow`, `round`, etc.

In [None]:
abs(-3)

In [None]:
abs(2-5)

In [None]:
print(3, 10 ** 2, 100.1)

In [None]:
max(3, 10 ** 2, 100.1)

## Nesting Functions

In [None]:
round(abs(1.6002-1.688), 4)

In [None]:
1.6002-1.688

In [None]:
abs(1.6002-1.688)

In [None]:
round(abs(1.6002-1.688), 4)

## Pandas

Pandas is an open source Python package that is most widely used for data science/data analysis and machine learning tasks. It is built on top of another library named `Numpy`, which provides support for arrays. Since we know how to perform operations on `NumPy` arrays we can operate on columns in a `pandas` dataframe. 

Pandas is a fast, powerful, flexible and (sometimes) easy to use open source data analysis and manipulation tool. Click the `Cheat Sheet` below to access the Data Wrangling with `pandas` [Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf).

In [None]:
## Import the pandas library with the alias pd
...

In [None]:
## Read the CSV file skyscrapers.csv from the data directory 
## and store the data in a DataFrame named ss (i.e., skyscrapers)
ss = ...

**Example 1.** Display the first 5 rows of the `ss` dataframe.

In [None]:
ss.head()

### Common `pandas` `DataFrame` Methods

- `.head()`
- `.info()`
- `.describe()`
- `.sample()`

Apply the following methods to the `skyscrapers` `DataFrame` in the examples below.

**Example 2.** `.info()`

In [None]:
## Display a concise summary of the ss DataFrame, 
## including the index, column names, non-null counts, 
## and data types for each column
ss.info()

**Example 3.** `.describe()`

In [None]:
## Generate descriptive statistics for the numerical columns in 
## the ss DataFrame, including count, mean, standard deviation, 
## minimum, quartiles, and maximum values
ss.describe()

**Example 4.** `.sample()`

In [None]:
## Return a random sample of rows from the ss DataFrame.
## By default, it returns one row, but the number of rows can 
## be specified as an argument.
ss.sample()

In [None]:
## Return a random sample of rows from the ss DataFrame.
## By default, it returns one row, but the number of rows can 
## be specified as an argument.
ss.sample(5)

### Accessing `pandas` `DataFrame` Attriubtes

- `.shape`
- `.columns`

**Example 5.** `.shape`

In [None]:
## Return the dimensions of the ss DataFrame as 
## a tuple (number of rows, number of columns)
ss.shape

**Example 6.** `.columns
`

In [None]:
## Return the list of column names in the ss DataFrame
ss.columns

### Rename columns in a `pandas DataFrame`

To rename specific columns in the `ss` `DataFrame`, use the `.rename` method with the `columns` parameter to change `status.started` to `year_started`, `status.completed` to `year_completed`, and `height in meters` to `height_meters`.


```python
columns = {'height in meters': 'height_meters',
           'status.completed': 'year_completed',
           'status.started': 'year_started'}
```

Additionally, set `inplace = True` to apply the changes directly to the original `DataFrame`.

```python
inplace = True
```

In [None]:
ss.rename(columns = {'height in meters': 'height_meters',
                     'status.completed': 'year_completed',
                     'status.started': 'year_started'},
          inplace = True)

In [None]:
ss.info()

### Drop columns in a `pandas DataFrame`

To drop (i.e., remove) the `Unnamed: 0` column from the `ss` `DataFrame`, use the `.drop` method with the `columns` parameter. 

```python
columns = ['Unnamed: 0']
```

Setting `inplace = True` ensures that the column is removed directly from the original `DataFrame` without creating a new one.

```python
inplace = True
```

In [None]:
ss.drop(columns = ['Unnamed: 0'], inplace = True)

### Accessing columns and values from a `pandas` `DataFrame` and `Series`

**Example 7.** Access the a `name` column from the `ss` dataframe and return a `Series` type object.

In [None]:
## Returns the values from the name column in the 
## ss DataFrame as a Series
ss['name']

In [None]:
## Returns the values from the name column in the 
## ss DataFrame as a Series
ss.name

### Series

A `pandas` `Series` is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). 

**Source:** [Geeks for Geeks](https://www.geeksforgeeks.org/python-pandas-series/)

**Example 8.** Access the `country` column from the `ss` `DataFrame` and return a `Series` type object.

In [None]:
ss['country']

### `Series` Attributes and Methods

**Attribute**

Attributes provide metadata or properties about a Series without modifying its data.
They are accessed without parentheses. For example,

- `.index`: Returns the index labels.
- `.shape`: Returns a tuple representing dimensions.
- `.size`: Total number of elements.

**Methods**

Methods are functions that perform operations or transformations on the data.
They are accessed with parentheses and may take arguments. For example, 

- `.sum()`: Computes the sum of values.
- `.mean()`: Computes the average.
- `.unique()`: Returns unique values.
- `.apply()`: Applies a custom function to each element.

### Summarize

**Example 9.** Summarize the frequency of countries in the dataset using the `.value_counts()` method, which counts the occurrences of each unique country and returns a sorted list, with the most frequent values appearing first.

In [None]:
## Returns the values from the country column in the 
## ss DataFrame as a Series sorted in descending order
ss['country'].value_counts()

This code

```python
ss.country.value_counts()
```

executed without any errors, but the output is incorrect because the same country has been entered in different ways, leading to inconsistent results. For instance:

- USA, US, and United States of America are listed separately, though they refer to the same country.

- United Arab Emirates appears multiple times as United Arab Emirates (UAE) and UAE.

- Malaysia is misspelled as Malasya.

- Saudi Arabia is listed as saudi Arabia (with a lowercase "s").

**Example 10.** Access the `height_meters` column from the `ss` `DataFrame` and return a `Series` type object.

In [None]:
## Returns the values from the height_meters column in the 
## ss DataFrame as a Series
ss.height_meters

We want to obtain summary statistics for the heights of the buildings. However, before doing so, we need to clean the data. We expected the column `height_meters` (the column that was renamed from `height in meters`) to be numeric, but the `.info()` output indicates that it is stored as an object.

In [None]:
ss.info()

#### String Methods

In Python, string methods such as `.lower()`, `.upper()`, `.strip()`, and `.replace()` allow for text manipulation, including case conversion, whitespace removal, and substring replacement. 

In `pandas`, the `.str` accessor provides vectorized string operations for entire columns in a `DataFrame`. Methods like `.str.lower()`, `.str.contains()`, and `.str.replace()` handle text transformations, filtering, and cleaning.

In [None]:
...

##### Step-by-Step Breakdown:

- Select the column `height_meters` from the `DataFrame` `ss`.

```python
ss['height_meters'] 
``` 

- Use `pandas` `.str.replace()` method to remove commas from all values in the column.
  
  - `","` Matches the comma
  
  - `""` Replaces commas with an empty string (removes them).

  - `regex = True` Ensures it uses regular expressions (important in newer pandas versions to avoid warnings).

```python
.str.replace(",", "", regex = True)`
```

- Converts the cleaned column (now free of commas) from a string type (object) to float for numerical operations.

```python
.astype(float)
```

**Example 11.** Add the cleaned and converted heights to the `ss` `DataFrame` as a new column named `clean_height_meters`.

In [None]:
ss['clean_height_meters'] = ss['height_meters'].str.replace(",", "", regex = True).astype(float)

In [None]:
ss.info()

In [None]:
ss['clean_height_meters'].describe()

**Example 12.** What is the average and median height (meters) for all skyscrapers in the dataset?

In [None]:
## The mean (average) height (meters) of the buildings
mean_height_meters = ss['clean_height_meters'].mean()

## The median height (meters) of the buildings
med_height_meters = ss['clean_height_meters'].median()

# Print the rounded mean and median
print("The average height in meters:", round(mean_height_meters, 1))
print("The median height in meters:", round(med_height_meters, 1))

Since a `Series` is essentially a 1-dimensional array similar to a `NumPy` `ndarray` with labels, we can can pass a `pandas` `Series` as an argument to many `NumPy` functions.

Let's import `Numpy` and see!

In [None]:
...

**Example 13.** What is the average and median number of floors for all skyscrapers in the dataset?

In [None]:
## The mean (average) height (meters) of the buildings
mean_height_meters = np.mean(ss['clean_height_meters'])

## The median height (meters) of the buildings
med_height_meters = np.median(ss['clean_height_meters'])

# Print the rounded mean and median
print("The average height in meters:", round(mean_height_meters, 1))
print("The median height in meters:", round(med_height_meters, 1))

#### Sort

The code 
```python
ss.sort_values(by='clean_height_meters', ascending=False)
```
sorts the `DataFrame` `ss` in descending order by `clean_height_meters` placing the tallest buildings first.

In [None]:
ss.sort_values(by = 'clean_height_meters', ascending = False)