# Outline<a class="anchor" id=outline></a>

[1.Working With Modules](#motivation)<br>
[2.Numpy](#numpy)<br>
[3.Intro to Pandas](#pandas)

# What are Modules? <a class="anchor" id=motivation></a>
("modules","packages","libraries")

<img src=https://theyoungteam.com/wp-content/uploads/2023/12/Blog-Post-Image-14.png>

[Good Resource List of Modules](https://ioflood.com/blog/python-libraries/)

## How do I know what modules I currently have?

>[Anaconda Modules List](https://docs.anaconda.com/free/anaconda/reference/packages/pkg-docs/)

[Go Up](#outline)

### How do I access modules I've already installed?

In [32]:
import numpy as np
import pandas as pd
import plotly as plt
import math
import time

In [None]:
np.

In [None]:
numpy.

### How do I access functionality from this package?

In [29]:
array_example = np.array([1,2,3])

In [30]:
type(array_example)

numpy.ndarray

In [31]:
array_example

array([1, 2, 3])

### Importing Specific Functionality

In [33]:
from numpy import array as np_array

In [34]:
x = np.array([3,4,5])

In [35]:
x

array([3, 4, 5])

### What if I need to install a new Python package?

# PIP

In [36]:
pip install pip

Note: you may need to restart the kernel to use updated packages.


In [None]:
!pip install [package name]
#pip install requests

In [None]:
pip install --upgrade requests

In [None]:
pip install -r requirements.txt

In [None]:
pip list

In [None]:
pip uninstall [package name]

In [9]:
pip list --outdated

Package                       Version   Latest       Type
----------------------------- --------- ------------ -----
aiobotocore                   2.4.2     2.9.0        wheel
aiofiles                      22.1.0    23.2.1       wheel
aiohttp                       3.8.3     3.9.1        wheel
aioitertools                  0.7.1     0.11.0       wheel
aiosignal                     1.2.0     1.3.1        wheel
aiosqlite                     0.18.0    0.19.0       wheel
alabaster                     0.7.12    0.7.16       wheel
altair                        5.0.1     5.2.0        wheel
ansi2html                     1.8.0     1.9.1        wheel
anyio                         3.5.0     4.2.0        wheel
appnope                       0.1.2     0.1.3        wheel
appscript                     1.1.2     1.2.4        wheel
argon2-cffi                   21.3.0    23.1.0       wheel
arrow                         1.2.3     1.3.0        wheel
astroid                       2.14.2    3

Note: you may need to restart the kernel to use updated packages.


[Go Up](#outline)

# Numpy<a class="anchor" id=numpy></a>

Why do I need to know this?

>1. **Performance:** Numpy objects are more memory-efficient and provide better performance for numerical computation.  NumPy objects are implemented in the the coding language C and used "fixed-type", reducing memory requirements and making computation fast

>2. **Pandas Foundation:** Pandas was built on top of NumPy

>3. **Statistical Capabilities:** 

>4. **Matrix Operations**

In [63]:
# Simulated dataset
data = [32.5, 29.8, 30.5, 230, 31.2, 
        33.8, 30.7, 35.6, 30.1, 200.0, 31.0]

# Calculate mean and standard deviation without numpy
mean_value = sum(data) / len(data)

# sum of squared diffs
squared_diff_sum = 0
for value in data:
    squared_diff_sum += (value - mean_value)**2
std_dev = (squared_diff_sum / len(data))**0.5


In [None]:
# Note: the for loop above can be replace by the following
squared_diff_sum = sum((value - mean_value)**2 for value in data)

# this embeds the loop functionality "in line"
# in Python this is called a 'generator expression'
# it does the same thing as the regular loop
# but is more memory efficient

### How does Numpy make this easier?

In [None]:
import numpy as np

# Simulated dataset
data = [32.5, 29.8, 30.5, 230, 31.2, 33.8, 
        30.7, 35.6, 30.1, 200.0, 31.0]

mean = np.mean(data) # doing this for fun
std = np.std(data) # more fun




In [67]:
import numpy as np

# Simulated dataset
data = [32.5, 29.8, 30.5, 230, 31.2, 33.8, 30.7, 35.6, 30.1, 200.0, 31.0]

mean_val = np.mean(data)
stdev = np.std(data)

# Define outlier threshold
outlier_threshold = 1.5 * stdev

# Identify and flag outliers
outliers = []

for val in data:
    if abs(val - mean_val) > outlier_threshold:
        outliers.append(val)

print("Outliers:", outliers)


Outliers: [230, 200.0]


### Performance?

In [75]:
num_list = [250,300,400,250]

# num_list = [250,300,400,350,100,200,250,300,85,12,18,36,4312,
#             200,120,12,34,876,890,101234,83,10,12.3,24.8,
#             35.1,89.123,39.444,2011563,83,248.35,12897,42,
#             360,400,12,2.4,36,41,15,85430
#            ]

In [76]:
# Numpy Arrays
# Quarterly sales data (in thousands)
sales_data = np.array(num_list)
annual_sales = sales_data.sum()
print("Annual Sales:", annual_sales, "thousand dollars")

Annual Sales: 1200 thousand dollars


In [73]:
import time

start_time = time.time()
sum(num_list)
end_time = time.time()
run_time1 = end_time - start_time

start_time = time.time()
sales_data.sum()
end_time = time.time()
run_time2 = end_time - start_time

time_diff = (run_time2/run_time1)
print(time_diff)

1.6533333333333333


>**Aside:** double click on the markdown cell below so you can see how the well formatted table was created.  A good short-cut for creating such tables is to ask ChatGPT to create the markdown for you...then you simply copy and paste!

| Data Size | Time for Python List (seconds) | Time for NumPy Array (seconds) |
|-----------|--------------------------------|--------------------------------|
| 10        | 1.67e-06                       | 0.000136                       |
| 100       | 1.43e-06                       | 1.91e-05                       |
| 1,000     | 6.44e-06                       | 2.62e-05                       |
| 10,000    | 7.51e-05                       | 5.13e-05                       |
| 100,000   | 0.000824                       | 0.000180                       |
| 1,000,000 | 0.00768                        | 0.00172                        |


## Broadcasting

In [77]:
sales_data

array([250, 300, 400, 250])

In [78]:
growth_rate = 1.10
updated_sales = sales_data * growth_rate
new_annual_sales = updated_sales.sum()
print("New Annual Sales:", new_annual_sales, "thousand dollars")

New Annual Sales: 1320.0 thousand dollars


## Indexing & Slicing

In [79]:
sales_data

array([250, 300, 400, 250])

In [84]:
sales_data[:2].sum()

550

In [85]:
H1_sales = sales_data[:2].sum()
H2_sales = sales_data[2:].sum()
print("H1 Sales vs H2 Sales:", H1_sales, "vs", H2_sales, "thousand dollars")

H1 Sales vs H2 Sales: 550 vs 650 thousand dollars


## Statistical Methods

In [86]:
monthly_sales = np.array([80, 90, 85, 88, 92, 85, 87, 90, 95, 85, 88, 90])
print("Average Monthly Sales:", monthly_sales.mean())
print("Median Monthly Sales:", np.median(monthly_sales))
print("Sales Variability (Std Dev):", monthly_sales.std())

# Question: why does the syntax look diff for mean vs median?
# why np.median(array) vs array.mean()

Average Monthly Sales: 87.91666666666667
Median Monthly Sales: 88.0
Sales Variability (Std Dev): 3.751851394830143


## Boolean Indexing and Masking
>**Concept**: How to use boolean conditions to filter arrays.<br>
**Example**: Identify months with sales exceeding a certain threshold.<br>
**Business Application**: Find high-performing months.

In [87]:
monthly_sales

array([80, 90, 85, 88, 92, 85, 87, 90, 95, 85, 88, 90])

In [89]:
monthly_sales > high_sales_threshold

array([False, False, False, False,  True, False, False, False,  True,
       False, False, False])

In [91]:
filter_condition = (monthly_sales > high_sales_threshold)

In [92]:
monthly_sales[filter_condition]

array([92, 95])

In [88]:
high_sales_threshold = 90
high_sales_months = monthly_sales[monthly_sales > high_sales_threshold]
print("Months with High Sales:", high_sales_months)

Months with High Sales: [92 95]


## Multi-Dimensional Arrays
>**Concept**: Introduce multi-dimensional arrays and operations on them.<br>
**Example**: Represent and manipulate data in a matrix format, such as sales data across multiple regions.<br>
**Business Application**: Calculate regional sales totals.

In [93]:
regional_sales.shape

(3, 4)

In [94]:
# Sales data: rows represent regions, columns represent quarters
regional_sales = np.array([[250, 300, 400, 350],
                           [200, 250, 350, 300],
                           [300, 350, 450, 400]
                          ]
                         )
total_regional_sales = regional_sales.sum(axis=1)
print("Total Sales by Region:",
      total_regional_sales,
      "thousand dollars")

# Question: what does "axis = 1" do?


Total Sales by Region: [1300 1100 1500] thousand dollars


# Pandas!!!!!<a class="anchor" id=pandas></a>

A Python module worth knowing REALLY well
> There are two major data structures in Pandas
> 1. Pandas Series
> 2. Pandas dataframe

[Go Up!](#outline)

In [6]:
import pandas as pd

In [7]:
ls

Intro_to_Modules_Numpy_Pandas.ipynb  data.csv


In [8]:
my_first_df = pd.read_csv('data.csv')

In [13]:
my_first_df.tail(20)

Unnamed: 0,Make,Model,Year,engine_fuel_type,horse_power,cylinders,transmission_type,drive_train,num_of_doors,market_category,vehicle_size,vehicle_style,highway_mpg,city_mpg,popularity_index,MSRP
11894,BMW,Z4,2014,premium unleaded (required),240.0,4.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,34,22,3916,48950
11895,BMW,Z4,2014,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Convertible,26,19,3916,56950
11896,BMW,Z4,2014,premium unleaded (required),335.0,6.0,AUTOMATED_MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Convertible,24,17,3916,65800
11897,BMW,Z4,2015,premium unleaded (required),240.0,4.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,34,22,3916,48950
11898,BMW,Z4,2015,premium unleaded (required),300.0,6.0,AUTOMATED_MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Convertible,24,17,3916,56950
11899,BMW,Z4,2015,premium unleaded (required),335.0,6.0,AUTOMATED_MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Convertible,24,17,3916,65800
11900,BMW,Z4,2016,premium unleaded (required),300.0,6.0,AUTOMATED_MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Convertible,24,17,3916,57500
11901,BMW,Z4,2016,premium unleaded (required),240.0,4.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,34,22,3916,49700
11902,BMW,Z4,2016,premium unleaded (required),335.0,6.0,AUTOMATED_MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Convertible,24,17,3916,66350
11903,BMW,Z8,2001,premium unleaded (required),394.0,8.0,MANUAL,rear wheel drive,2.0,"Exotic,Luxury,High-Performance",Compact,Convertible,19,12,3916,128000
