# Warning - This Notebook Will Be Like #1 - Very Heavy on Syntax, Until the End

### If you get bored, just skip to the end and do the problems!

<img src="https://media1.tenor.com/m/iHgOa-519TkAAAAC/angry-angry-panda.gif" width="690" height="476.1" alt="panda go smash haha" fetchpriority="high" style="max-width: 690px; background-color: rgb(84, 90, 82);">

# Numpy

Numpy (Numerical Python) is a library/package which makes operating on lists/arrays/matrices much faster because it compiles down to the C programming language (under the hood). This makes it awesome for processing a lot of data.

Let's see this below in a simply example where we just want to multiply each element in a list by 2.

In [33]:
# Add in our testing notebook as usual
!wget https://github.com/jakesanghavi/FraudedPythonTeacher/blob/main/testing_files/testing3.py -P testing_files/
%run testing_files/testing3.py

In [34]:
'''
Using the np alias is the standard way to import numpy
'''
import numpy as np
import time

# Let's make a list in standard Python vs. a list in NumPy
size = 1_000_000

# Python list version
py_list = list(range(size))
start = time.time()
py_result = [x * 2 for x in py_list]
end = time.time()
diff_p = end - start
print(f"Python list took: {diff_p:.5f} seconds")

# NumPy version
np_array = np.arange(size)
start = time.time()
np_result = np_array * 2
end = time.time()
diff_n = end - start
print(f"NumPy array took: {diff_n:.5f} seconds")

print(f"NumPy is over {int(diff_p/diff_n)}x faster!!!")

Python list took: 0.04329 seconds
NumPy array took: 0.00140 seconds
NumPy is over 31x faster!!!


# Numpy - Major Changes

#### Make a numpy array (rather than python list): `arr = np.array([1, 2, 3, 4])`
#### Arrange numbers evenly: `np.arange(0, 10, 2) --> [0 2 4 6 8] OR np.linspace(0, 1, 5) --> [0, 0.2, 0.4, 0.8]`

np.arange takes arguments as `(start, end, space)` to generate numbers in the range `[start, end)` spaced apart by `space`  in value
np.linspace takes arguments as `(start, end, count)` to generate `count` evenly spaced numbers in the range `[start, end)`

### Basic Math Operations
Unlike python lists, you can do numerical operations (+,-,*,/) straight on the array! ex. `np.array([1, 2, 3, 4]) + 4 --> [5,6,7,8]`
You can also do other nice operations, like `sum()`, `mean()`, `std()` (standard deviation), and more! ex. `np.mean([1, 2, 3, 4]) --> 2.5`

### Filter Array

`mask = arr > 5` --> This makes a numpy array of the same length as arr with values of True and False based on the condition
`filtered_arr = arr[mask]` --> This extracts only entries with indices where the mask had True values

# Numpy is Super Versatile!

I won't go through everything it can do, just know that it is a very useful tool for making things faster/easier/more concise when working with any type of list/matrix-like data. 

**This is helpful to know for the next part, Pandas!**  
**But first, try out some practice problems.**

In [3]:
### Use numpy to generate all even numbers from 1-1,000, as a numpy array!

even_nums = '...'

In [4]:
# Testing time!

test_even_nums(even_nums)

Hmm it looks like this isn't quite right. Try again!


In [5]:
# Generate an array of 500 random integers using numpy (hint: look up np.random.randint)
# Then, add 5 to all array values, and find the mean and standard deviation
# THIS SHOULD ONLY TAKE 4 LINES OF CODE

random_array = '...'

In [6]:
# Testing time

test_random_stats(random_array, random_array_mean, random_array_std)

NameError: name 'random_array_mean' is not defined

In [9]:
### Given a numpy array of student test scores, filter out all scores below 60
### After this, we will move on to Pandas!

# Random scores are generated for you
test_scores = generate_random_test_scores(1000)

filtered_scores = '...'

In [10]:
# Testing time!

test_score_filter(test_scores, filtered_scores)

Hmm it looks like this isn't quite right. Try again!


<hr style="border-top: 3px solid #bbb;">

# Pandas 🐼 🐼 🐼

Pandas is basically Excel for cool kids. It's Python's most popular library for working with tabular data, which is one of the most common ways data comes -- many forms of data can be put into table form.

Pandas tables (like any table) have rows and columns. The columns have string names and the rows have integer index numbers. Let's look at a Pandas table (called a DataFrame) below:

In [35]:
'''
Like numpy, pandas also uses an alias when it is imported
'''
import pandas as pd

data = {
    'Name': ['Tony', 'Steve', 'Natasha'],
    'Role': ['Iron Man', 'Captain America', 'Black Widow'],
    'Age': [48, 105, 35]
}

df = pd.DataFrame(data)


'''
The above is also equivalent to this:
'''
# data = [['Tony', 'Steve', 'Natasha'],
#           ['Iron Man', 'Captain America', 'Black Widow'],
#           [48, 105, 35]]

# colnames = ['Name', 'Role', 'Age']
# df= pd.DataFrame(data, columns=colnames)


df

Unnamed: 0,Name,Role,Age
0,Tony,Iron Man,48
1,Steve,Captain America,105
2,Natasha,Black Widow,35


# DataFrames are awesome!

If you couldn't tell from the above, DataFrames basically build on the fantastic dictionary data structure. This lets you look things up in the table, and change things in the table very quickly if needed. You can also pick out certain columns and rows very easily.

Let's see how that's done:

In [None]:
# Let's print out just the Name column
# You access columns in Pandas like you would dictionary keys

df['Name']

In [None]:
# Let's print out just the first row using .iloc
# iloc picks a row at the specified index

df.iloc[0]

In [None]:
# To change something at a specific postion, reference it with .loc[row_number, column]
df.loc[0, 'Name'] = 'Neha'

df

In [None]:
# To perform basic mathematical operations on Pandas columns, it is super easy just like Numpy!
# This is because the values in a column are very similar to Numpy arrays!

df['Age'] = df['Age'] + 5

df

# SQL ... Yeah the 5th Circle of Hell Sadly is a Bit Useful

yea yea ok obviously everyone hates this part

I lied we won't actual be doing SQL but it's like highkey the same thing

Anyways doing filters and aggregations on tables (basically the main point of SQL too) is often really useful.
Let's see how we can do this at a basic level.

### Important Note

Pandas is not an "in-place" operator by default. This means that if you say something like:
`df[df['Name'] == 'Neha']`

Nothing actually happens to the `df` variable. You have to say `df = df[df['Name'] == 'Neha']`

For the purposes of printing I will use the former, but just know that's not changing the actual df.

In [None]:
# The below syntax looks a little silly. Let's explain how it works
# Basically, the part inside of the outermost set of brackets returns a 1D-array of True/False values,
# True where the condition is met, and False where is it not. 
# Then, the df[] on the outside will return only rows where that is true

df[df['Role'] == 'Iron Man']

In [None]:
# Combining conditions requires you to wrap each condition in parentheses
# Note: unlike base Python which uses 'and' and 'or' like words,
# Pandas uses the so-called bitwise operators: & and | for and and or

df[(df['Role'] == 'Iron Man') | (df['Age'] > 100)]

## Aggregations

In [None]:
# We'll load up a new DF to showcase this
data = {
    'Employee': ['Alice', 'Bob', 'Charlie', 'Diana', 'Ethan', 'Fiona', 'George', 'Hannah'],
    'Department': ['HR', 'Engineering', 'HR', 'Engineering', 'Marketing', 'Engineering', 'Marketing', 'HR'],
    'Salary': [60000, 80000, 65000, 85000, 55000, 87000, 53000, 62000],
    'YearsAtCompany': [2, 5, 3, 6, 1, 7, 2, 4]
}

df = pd.DataFrame(data)

# Pandas groupby is used to "aggregate" data. Let's explain
# Groupby takes in 1+ columns to "group" on. This basically makes a bunch of smaller DataFrames,
# where each DataFrame only contains the exact same values for the grouped columns.
# ex. for the below, one DataFrame will hold just HR, one will hold Engineering, and one will hold Marketing

# Then, on each of these mini DataFrames, we will do some sort of overall calculation on some of the other columns
# Below, we get the average salary and total years at the company, BY each department
# I hope that makes sense!
df.groupby(['Department'], as_index=False).agg({'Salary': 'mean', 'YearsAtCompany': 'sum'})

# Miscellaneous Cool Functions

Here are some other functions you might like to use:

`df.sort_values('Score', ascending=False)` --> (ascending is True by default)

`df[df['Name'].str.contains('Jake')]` --> (adding `.str()` after a column name lets you use all basic string operations for the whole column in Numpy-like fashion

`df.isnull()` --> Checks for null (missing) data

`df.fillna(0)` --> Can fill missing data with a particular value

Like Numpy, Pandas is a huge library with a ton of functions. So, there is way more but this was already enough yap. Why don't you try to use these libraries with some basic problems?

In [23]:
### Make a Pandas DataFrame from this dictionary
### YEs yes it's chatgptdp generated so the numbers and genres are a bit off
#### but you get the idea
### Not testing anything for this part it'll be obvious if the df is made

data = {
    'Name': [
        'Taylor Swift', 'Beyoncé', 'Bruno Mars', 'Adele', 'Ed Sheeran',
        'Billie Eilish', 'The Weeknd', 'Lady Gaga', 'Drake', 'Dua Lipa',
        'Harry Styles', 'Kendrick Lamar', 'Rihanna', 'Doja Cat', 'Justin Bieber',
        'Lizzo', 'Ariana Grande', 'Sam Smith', 'Olivia Rodrigo', 'SZA'
    ],
    'Genre': [
        'Pop', 'R&B', 'Pop', 'Soul', 'Pop',
        'Alt Pop', 'R&B', 'Pop', 'Hip Hop', 'Pop',
        'Pop', 'Hip Hop', 'R&B', 'Pop', 'Pop',
        'Soul', 'Pop', 'Pop', 'Pop', 'R&B'
    ],
    'Age': [
        34, 42, 38, 35, 33,
        22, 34, 38, 37, 28,
        30, 36, 36, 28, 30,
        35, 30, 31, 21, 34
    ],
    'Grammys': [
        12, 32, 15, 16, 4,
        7, 4, 13, 5, 3,
        2, 17, 9, 1, 2,
        4, 2, 5, 3, 4
    ]
}

musician_df = pd.DataFrame(data)

In [30]:
### Get the row that just has Lady Gaga's info

gaga_row = '...'

In [None]:
# Testing time!

test_gaga_row(musician_df, gaga_row)

In [32]:
### Get the average age AND total number of Grammys for each Genre! (hint: use group by!)
### I won't test this but in the cell below is what it should look like


Unnamed: 0,Genre,Age,Grammys
0,Alt Pop,22.0,7
1,Hip Hop,36.5,22
2,Pop,31.0,62
3,R&B,36.5,49
4,Soul,35.0,20


In [36]:
test_groupby_fun(musician_df)

Unnamed: 0,Genre,Age,Grammys
0,Alt Pop,22.0,7
1,Hip Hop,36.5,22
2,Pop,31.0,62
3,R&B,36.5,49
4,Soul,35.0,20


# Awesome work! Time to move on to the last notebook!