# Week 7 Lecture Notebook

## Introduction to Data Moves Using Python

Today we will …

- navigate the Jupyter Notebook interface.
- import modules.
- explore functionality from Python modules (e.g., math, NumPy, Pandas).
- apply data moves (e.g., Summarize) to Pandas dataframes.


## Importing

Python itself has a small built-in core. Extra functionality (like math, data analysis, working with files, web requests, etc.) is stored in modules and packages.

- Modules: Single Python files (e.g., math.py)
- Packages: Collections of modules (e.g., pandas, numpy)

By importing, you reuse that code instead of writing everything from scratch.

### Math

In [None]:
pi

**Example 1.** Import the `math` module.

In [None]:
...

In [None]:
math.pi

In [None]:
math.cos(math.pi)

### NumPy

NumPy (short for Numerical Python) is a Python library used for numerical and scientific computing.

- Provides the ndarray object for efficient storage and manipulation of arrays/matrices
- Runs faster than plain Python lists
- Supports advanced math operations (linear algebra, statistics, Fourier transforms, random numbers)
- Forms the foundation for many other packages like pandas, scikit-learn, and TensorFlow

#### Lists

Python lists are flexible as they can hold multiple items, mix different data types, and even be multi-dimensional, but their lack of vectorized arithmetic makes them inefficient for numerical computations.

In [None]:
lst1 = [1, 2, 3]
lst2 = [4, 5, 6]

In [None]:
lst1 + lst2

In [None]:
lst2 * 2

**Example 2.** Import NumPy using the alias `np`. 

In [None]:
...

#### Vectorized Operations

Vectorized operations allow computations to be applied to entire arrays or matrices at once, rather than processing elements one by one.

**Example 3.** Create arrays `arr1` and `arr2` from the values in `lst1` and `lst2`.

In [None]:
## np.array() converts input data (like a list or tuple) into a NumPy array.
arr1 = np.array(...)
arr2 = np.array(...)

In [None]:
print(arr1)
print(arr2)

**Example 4.** Print the type for `arr1` and `arr2`.

In [None]:
print(type(arr1))
print(type(arr2))

**Example 5.** Add `arr1` and `arr2`.

In [None]:
arr1 + arr2

**Example 6.** Square the values in `arr1` and `arr2`.

In [None]:
...(arr1)

**Example 7.** Take the log (base $e$) of the values in `arr1` and `arr2`.

In [None]:
...(arr1)

## Pandas

Pandas is an open source Python package that is most widely used for data science, data analysis and machine learning tasks. It is built on top of another library named `Numpy`, which provides support for arrays. Since we know how to perform operations on `NumPy` arrays we can operate on columns in a `pandas` dataframe. 

Pandas is a fast, powerful, flexible and (sometimes) easy to use open source data analysis and manipulation tool. Click the `Cheat Sheet` below to access the Data Wrangling with `pandas` [Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf).

**Example 8.** Import Pandas with the alias `pd`.

In [None]:
...

In [None]:
## Read the CSV file skyscrapers.csv from the data directory 
## and store the data in a DataFrame named skyscrapers
ss = pd.read_csv('...')

### Pandas Dataframe

A `pandas` `DataFrame` is a two-dimensional, labeled data structure in Python, similar to a table in Excel or an R dataframe. It organizes data into rows and columns, where rows represent observations (with an index label) and columns represent variables. Each column is a `pandas` `Series`, which has a name and holds data of the same type (such as numbers, text, or dates).

### Attributes & Methods

#### Attributes

An attribute is a property of the object. You don’t call it with () — you just access it, because it usually _**describes something**_ about the object.

In [None]:
## .columns returns the column names as an Index object
ss.columns

In [None]:
## .shape returns the dimensions of dimensions of a dataframe or series
ss.shape

#### Methods

A method is a function attached to an object. You call it with parentheses () because it usually _**does something**_ (like computation, printing, or transforming data).

In [None]:
## The .head() method displays the rows in a Pandas datafram
ss.head()

In [None]:
## The .head() method displays the rows in a Pandas datafram
ss.head()

In [None]:
## The .head() method displays the rows in a Pandas datafram
ss.head()

In [None]:
## The .head() method displays the rows in a Pandas datafram
ss.head()

In [None]:
## The .info() method prints a summary of the dataframe structure
ss.info()

## Data Wrangling with Pandas Methods

Data wrangling with pandas methods involves cleaning, reshaping, and preparing data for analysis. Pandas provides powerful methods such as `.drop()` to remove rows or columns, `.rename()` to adjust labels, and `.astype()` to change data types. These methods make it easy to transform raw datasets into tidy, structured dataframes that are ready for exploration and analysis.

In [None]:
## The .drop() method removes specified rows or columns from a dataframe
ss = ss.drop(columns = ...)

In [None]:
## The .rename() method changes the labels of rows or columns in a dataframe
ss = ss.rename(
    columns = {
        ... : 'status_started',
        ... : 'status_completed',
        ... : 'height_meters'
    }
)

## Verify that the column names have changed
ss.columns

**Example 9.** Access the `height_meters` column from the `ss` dataframe.

In [None]:
...

In [None]:
## The .to_list() method converts a Pandas Series into a Python list
print(ss['floors'].to_list())

**Example 10.** Filter the `ss` dataframe to find the rows where floor is either `'103 floors'` or `'73 (68 Above Ground and 5 Below Ground)'`.

In [None]:
ss['floors']

In [None]:
print(ss['floors'].to_list())

In [None]:
ss['floors'] == ...

In [None]:
ss['floors'] == ...

In [None]:
(ss['floors'] == ...) | (ss['floors'] == ...)

In [None]:
...

In [None]:
...

In [None]:
mask = ...
ss[mask]

In [None]:
## The .loc[] attribute accesses rows and columns in a dataframe by their labels.
ss.loc[48]

In [None]:
## The .loc[] attribute accesses rows and columns in a dataframe by their labels.
ss.loc[61]

In [None]:
## The .loc[] attribute accesses rows and columns in a dataframe by their labels.
ss.loc[48, 'floors'] = ...

In [None]:
## The .loc[] attribute accesses rows and columns in a dataframe by their labels.
ss.loc[61, 'floors'] = ...

In [None]:
## Verify that the values have changed
ss.loc[[48, 61], 'floors']

**Example 11.** Convert the `floors` column of the `ss` dataframe into integers.

In [None]:
print(ss['floors'].to_list())

In [None]:
ss['floors'] = ss['floors'].astype(...)

## Verify the data type of thecolumn has changed
ss.info()

## Data Moves

Data moves are the set of actions that analysts take to transform a dataset—such as grouping or filtering data, creating new summary variables, or restructuring the dataset to highlight or alter specific features of the data and enable different analytical techniques (Erickson et. al 2019).

### Summarizing Non-Numerical Data (Categorical & Logical)

Categorical and logical variables can also be summarized using frequency counts.

**Example 11.** Compute the frequency distribution of country names in the dataset.

In [None]:
## The .value_counts() method returns the frequency of unique values in a Series.
ss['country']

In [None]:
## The .value_counts(normalize=True) method returns the proportions of unique values in a Series.
ss['country']

### Summarizing Numerical Data

The summarizing data move involves condensing data by calculating key measures such as counts, averages, percentages, or other summary statistics. This reduction can reveal patterns and trends, making the information easier to interpret and compare.

In [None]:
## The .mean() method calculates the average value of a numeric Series.
ss.floors

In [None]:
## The .describe() method generates summary statistics for a Series or dataframe, 
## including count, mean, standard deviation, minimum, quartiles, and maximum.
ss.floors