## Part 1

### Introduction to NumPy

#### What is NumPy

NumPy is a Python library used for numerical computing. It supports large, multi-dimensional arrays and matrices and a collection of mathematical functions to operate on these arrays efficiently.

#### Why is NumPy important in Data Science?

NumPy is essential in  Data Science due to its ability to handle large data sets efficiently. It provides high performance multidimensional arrays and tools for working with these arrays, making it indispensable for tasks like data manipulation, analysis, and machine learning. 

In [3]:
# Installation and setup 

!pip install numpy




In [14]:
# Import the numpy library
import numpy as np

# Create an array from a list
arr1 = np.array([1, 2, 3, 4, 5])
print("Array from list:\n", arr1)

# Create an array of zeros with 5 elements
arr2 = np.zeros(5)
print("Array of zeros:\n", arr2)

# Create a 3x3 array of ones
arr3 = np.ones((3, 3))
print("3x3 Array of ones:\n", arr3)

# Create an array with values ranging from 0 to 10 with a step of 12 (only the start value will be included as step is greater than the range)
arr4 = np.arange(0, 10, 12)
print("Array with values ranging from 0 to 10 with a step of 12:\n", arr4)

# Create a 3x3 array of random values between 0 and 1
arr5 = np.random.rand(3, 3)
print("3x3 Array of random values:\n", arr5)


Array from list:
 [1 2 3 4 5]
Array of zeros:
 [0. 0. 0. 0. 0.]
3x3 Array of ones:
 [[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]
Array with values ranging from 0 to 10 with a step of 12:
 [0]
3x3 Array of random values:
 [[0.48009881 0.17751879 0.09635787]
 [0.61068463 0.97508596 0.8301523 ]
 [0.75388856 0.61803462 0.40799539]]


In [15]:
# Array atributes: shape, size, dtype

# Shape of the array
print(arr1.shape) #rows and columns

# Size of the array
print(arr1.size) #number of elements

#Data type of the array
print(arr1.dtype)

(5,)
5
int32


In [16]:
# Indexing and slicing arrays

# Accessing elements
print(arr1[0])

# Slicing 
print(arr1[1:4])

1
[2 3 4]


In [19]:
# Array operation: arithmetic, agggregation, broadcasting

# Arithmetic operations
arr6=arr1+arr2
print(arr6)

# Aggregation function

print(np.sum(arr1))

# Broadcasting

arr7=arr1*2
print(arr7)

[1. 2. 3. 4. 5.]
15
[ 2  4  6  8 10]


In [23]:
# Reshaping arrays

# Reshape array

arr8=np.arange(9).reshape(3,3)
print(arr8)

[[0 1 2]
 [3 4 5]
 [6 7 8]]


In [27]:
# Stacking and splitting arrays



# Stacking arrays vertically
arr9 = np.vstack((arr8, arr8))  # Stack arr8 on top of itself
print("Stacked vertically (vstack):\n", arr9)

# Splitting arrays
arr10, arr11 = np.split(arr9, 2)  # Split arr9 into 2 equal parts
print("\nSplit arrays:\n", arr10, "\n", arr11)


Stacked vertically (vstack):
 [[0 1 2]
 [3 4 5]
 [6 7 8]
 [0 1 2]
 [3 4 5]
 [6 7 8]]

Split arrays:
 [[0 1 2]
 [3 4 5]
 [6 7 8]] 
 [[0 1 2]
 [3 4 5]
 [6 7 8]]


In [31]:
# Transposing arrays
# Transpose array 

arr12=arr8.T
print("Array 8 Transpose\n",arr12)
print("\nArray 8\n",arr8)

Array 8 Transpose
 [[0 3 6]
 [1 4 7]
 [2 5 8]]

Array 8
 [[0 1 2]
 [3 4 5]
 [6 7 8]]


In [33]:
# Universal functions (ufuncs)
# Universal funcions

arr13=np.sin(arr1)
print("sin of array 1\n", arr13)
print("\narray 1\n", arr1)

sin of array 1
 [ 0.84147098  0.90929743  0.14112001 -0.7568025  -0.95892427]

array 1
 [1 2 3 4 5]


In [38]:
# Fancy indexing
indices=np.array([0,2,4])

print(arr1[indices])

[1 3 5]


In [48]:
# Boolean indexing

# Create an array from a list
arr1 = np.array([1, 2, 3, 4, 5])

# Boolean indexing
bool_arr = arr1 > 2  # Create a boolean array where each element indicates if the corresponding element in arr1 is greater than 2
print("Boolean array:", bool_arr)

# Print elements in arr1 that are greater than 2 using the boolean array
print("Elements in arr1 greater than 2:", arr1[bool_arr])

Boolean array: [False False  True  True  True]
Elements in arr1 greater than 2: [3 4 5]


In [49]:
# Vectorized operation

arr14=arr1+10
print(arr14)

[11 12 13 14 15]


In [50]:
# Broadcasting
arr15=arr1+np.array([
    [10],
    [20],
    [30],
    [40],
    [50]
])

print(arr15)

[[11 12 13 14 15]
 [21 22 23 24 25]
 [31 32 33 34 35]
 [41 42 43 44 45]
 [51 52 53 54 55]]


## Part 2

### Introduction to Pandas

#### Pandas

Pandas is a powerful data manipulation and analysis library built on top of NumPy. It provides a high-level data structures like Series and DataFrame designed to efficiently handle structured data. 

#### Series and DataFrame Objects

Pandas has two main data structures:

1. **Series**: A one-dimensional array-like object that can hold any data type.
2. **DataFrame**: A two-dimensional labeled data structure with columns of potentially different data types.


#### Importance

Pandas is essential in data analysis tasks as it simplifies many common data manipulation tasks such as cleaning, transformation, and exploration. It provides intuitive and flexible tools for working with structured data. 

In [52]:
# Installation and setup
!pip install pandas



In [56]:
# Creating Series objects

import pandas as pd

# Create series from a list

s1=pd.Series([1,2,3,4,5])

print("Series 1:\n",s1)

#  Create series from an array

s2=pd.Series(np.array([1,2,3,4,5]))
print("\nSeries 2:\n",s2)

# Create series from a dictionary

s3=pd.Series({"a":1,
              "b":2,
              "c":3})

print("\nSeries 3:\n",s3)




Series 1:
 0    1
1    2
2    3
3    4
4    5
dtype: int64

Series 2:
 0    1
1    2
2    3
3    4
4    5
dtype: int32

Series 3:
 a    1
b    2
c    3
dtype: int64


In [58]:
# Indexing and slicing Series

# Accessing elements by label
print(s3["a"])  # Prints the element with label "a"

# Accessing elements by position using .iloc (to avoid the FutureWarning)
print(s3.iloc[0])  # Prints the element at position 0 (first element)

# Slicing
print(s3[:2])  # Prints the first two elements of the Series

1
1
a    1
b    2
dtype: int64


In [61]:
# Operations on Series

# Arithmetic operations
s4=s1+s2
print("Arithmetic operations\n", s4)
# Element-wise operations
s5=s1*2
print("\nElement-wise operations\n", s5)

# Aggregation function

print("\nAgregation function\n", s1.sum())

Arithmetic operations
 0     2
1     4
2     6
3     8
4    10
dtype: int64

Element-wise operations
 0     2
1     4
2     6
3     8
4    10
dtype: int64

Agregation function
 15


In [62]:
#Handling missing data
# Drop missing values

s6 = s1.dropna()

# Fill missing values

s7 = s1.fillna(0)

# Check for missing values

print(s1.isnull())

0    False
1    False
2    False
3    False
4    False
dtype: bool


In [64]:
#Creating DataFrames
# Create DataFrame from a dictionary

data = {'Name': ['Alice', 'Bob', 'Charlie'],
 'Age': [25, 30, 35]}

df1 = pd.DataFrame(data)

# Create DataFrame from a list of lists

data = [['Alice', 25], ['Bob', 30], ['Charlie', 35]]

df2 = pd.DataFrame(data, columns=['Name', 'Age'])

In [65]:
#Indexing and slicing DataFrames

# Label-based indexing

print(df1.loc[0, 'Name'])

# Position-based indexing

print(df1.iloc[0, 0])

# Slicing

print(df1[:2])

Alice
Alice
    Name  Age
0  Alice   25
1    Bob   30


In [66]:
#Basic operations

# Sorting

df1_sorted = df1.sort_values(by='Age')
print(df1_sorted)

# Filtering
df1_filtered = df1[df1['Age'] > 30]
print(df1_filtered)

# Selecting columns
names = df1['Name']
print(names)


      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
      Name  Age
2  Charlie   35
0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object


In [67]:
#Data manipulation

# Adding a column
df1['Gender'] = ['Female', 'Male', 'Male']

# Deleting a column
del df1['Gender']

# Updating a column
df1['Age'] = df1['Age'] + 1

In [68]:
# Data Cleaning and Preparation

#Handling missing values
# Drop missing values

df1_cleaned = df1.dropna()
# Fill missing values

df1_filled = df1.fillna(0)
# Check for missing values

print(df1.isnull().any())

Name    False
Age     False
dtype: bool


In [70]:
# Data normalization(numerical value only)

# Min-Max normalization
# Min-Max normalization scales the data to a range [0, 1]
df1_normalized = (df1['Age'] - df1['Age'].min()) / (df1['Age'].max() - df1['Age'].min())
print("\nMin-Max Normalized DataFrame (Age):\n", df1_normalized)

# Standardization
# Standardization scales the data to have a mean of 0 and a standard deviation of 1
df1_standardized = (df1['Age'] - df1['Age'].mean()) / df1['Age'].std()
print("\nStandardized DataFrame (Age):\n", df1_standardized)


Min-Max Normalized DataFrame (Age):
 0    0.0
1    0.5
2    1.0
Name: Age, dtype: float64

Standardized DataFrame (Age):
 0   -1.0
1    0.0
2    1.0
Name: Age, dtype: float64


In [74]:
# Log transformation
# Apply log transformation to the 'Age' column
df1['Log_Age'] = np.log(df1['Age'])
print("\nDataFrame after log transformation (Age):\n", df1)

# Encoding categorical variables
# Use pd.get_dummies to encode the 'Gender' column
df1['Gender'] = ['Female', 'Male', 'Male']
df1_encoded = pd.get_dummies(df1, columns=['Gender'])
print("\nDataFrame after encoding 'Gender':\n", df1_encoded)


DataFrame after log transformation (Age):
       Name  Age   Log_Age
0    Alice   26  3.258097
1      Bob   31  3.433987
2  Charlie   36  3.583519

DataFrame after encoding 'Gender':
       Name  Age   Log_Age  Gender_Female  Gender_Male
0    Alice   26  3.258097           True        False
1      Bob   31  3.433987          False         True
2  Charlie   36  3.583519          False         True


In [78]:
#Data aggregation and grouping
# Group by age and calculate mean age
age_groups = df1.groupby('Age')
mean_age = age_groups['Age'].mean()

## Advance Pandas 

In [79]:
## Merging, Joining, and Concatenating DataFrames

# Concatenating DataFrames

import pandas as pd

# Create sample DataFrames

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
 'B': ['B0', 'B1', 'B2']})

df2 = pd.DataFrame({'A': ['A3', 'A4', 'A5'],
 'B': ['B3', 'B4', 'B5']})

# Concatenate along rows

result_row = pd.concat([df1, df2])

# Concatenate along columns

result_col = pd.concat([df1, df2], axis=1)

In [81]:
#Merging DataFrames

# Create sample DataFrames

# Creating the 'left' DataFrame
left = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
                     'value': ['V0', 'V1', 'V2']})
print("Left DataFrame:\n", left)

# Creating the 'right' DataFrame
right = pd.DataFrame({'key': ['K1', 'K2', 'K3'],
                      'value': ['V1', 'V2', 'V3']})
print("\nRight DataFrame:\n", right)

# Inner join

# Perform inner join on 'key' column
# Only the rows with matching keys in both DataFrames are included in the result
inner_join = pd.merge(left, right, on='key', how='inner')
print("\nInner Join:\n", inner_join)

# Left join

# Perform left join on 'key' column
# All rows from the left DataFrame are included, with matching rows from the right DataFrame
# Non-matching rows from the right DataFrame have NaN values
left_join = pd.merge(left, right, on='key', how='left')
print("\nLeft Join:\n", left_join)

# Right join

# Perform right join on 'key' column
# All rows from the right DataFrame are included, with matching rows from the left DataFrame
# Non-matching rows from the left DataFrame have NaN values
right_join = pd.merge(left, right, on='key', how='right')
print("\nRight Join:\n", right_join)

# Outer join

# Perform outer join on 'key' column
# All rows from both DataFrames are included
# Non-matching rows have NaN values in place of missing values from the other DataFrame
outer_join = pd.merge(left, right, on='key', how='outer')
print("\nOuter Join:\n", outer_join)


Left DataFrame:
   key value
0  K0    V0
1  K1    V1
2  K2    V2

Right DataFrame:
   key value
0  K1    V1
1  K2    V2
2  K3    V3

Inner Join:
   key value_x value_y
0  K1      V1      V1
1  K2      V2      V2

Left Join:
   key value_x value_y
0  K0      V0     NaN
1  K1      V1      V1
2  K2      V2      V2

Right Join:
   key value_x value_y
0  K1      V1      V1
1  K2      V2      V2
2  K3     NaN      V3

Outer Join:
   key value_x value_y
0  K0      V0     NaN
1  K1      V1      V1
2  K2      V2      V2
3  K3     NaN      V3


In [83]:
#Joining DataFrames

# Create sample DataFrames

left = pd.DataFrame({'value1': [1, 2, 3]}, index=['a', 'b', 'c'])
right = pd.DataFrame({'value2': [4, 5, 6]}, index=['a', 'b', 'd'])

# Join based on index

join_df = left.join(right, how='inner')

print("Join based on index:\n", join_df)

Join based on index:
    value1  value2
a       1       4
b       2       5


### Technical level questions and excercises

In [84]:
###Implementing Matrix Operations: Matrix Multiplication
import numpy as np
# Create two matrices
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

In [85]:
## Question: Multiply matrices A and B.
# Answer
C = np.dot(A, B)
print(C) # Output: [[19 22] [43 50]]

[[19 22]
 [43 50]]


In [86]:
## Question: Matrix Inversion - Find the inverse of matrix A.
# Answer
A_inv = np.linalg.inv(A)
print(A_inv) # Output: [[-2. 1.] [ 1.5 -0.5]]

[[-2.   1. ]
 [ 1.5 -0.5]]


In [87]:
## Question: Calculate the eigenvalues and eigenvectors of matrix A.
# Answer
eigenvalues, eigenvectors = np.linalg.eig(A)
print("Eigenvalues:\n",eigenvalues) # Output: [5.37228132 -0.37228132]
print("\nEigenvectors:\n",eigenvectors)

Eigenvalues:
 [-0.37228132  5.37228132]

Eigenvectors:
 [[-0.82456484 -0.41597356]
 [ 0.56576746 -0.90937671]]


In [88]:
## Question: Solve the system of equations 3x + y = 9 and x + 2y = 8.
# Solving a system of linear equations
A = np.array([[3, 1], [1, 2]])
b = np.array([9, 8])
x = np.linalg.solve(A, b)
print(x)
# Answer
x = np.linalg.solve(A, b)
print(x) # Output: [2. 3.]

[2. 3.]
[2. 3.]


In [90]:
# Define the 2D array
arr15 = np.array([[1, 2, 3],
 [4, 5, 6],
 [7, 8, 9]])
print("Array:\n",arr15)
# Define row and column indices
row_indices = np.array([1, 2])
column_indices = np.array([0, 1])
# Use a tuple for fancy indexing
selected_elements = arr15[(row_indices, column_indices)]
print("\nSelected Elements:\n", selected_elements)


Array:
 [[1 2 3]
 [4 5 6]
 [7 8 9]]

Selected Elements:
 [4 8]


# NumPy Practice Questions

1. **Array Creation**: Create a NumPy array of integers from 1 to 10.

2. **Array Operations**: Perform element-wise addition, subtraction, multiplication, and division on two NumPy arrays.

3. **Array Slicing**: Extract the first three elements from a NumPy array.

4. **Array Reshaping**: Reshape a 1D NumPy array into a 2D array with 2 rows and 3 columns.

5. **Array Aggregation**: Calculate the mean, median, and standard deviation of a NumPy array.


1. **Array Creation**:

In [93]:
array = np.array([i for i in range(1, 11)])
print(array)

[ 1  2  3  4  5  6  7  8  9 10]


2. **Array Operations**:

In [101]:
array_1=np.array([i for i in range(1, 11)])
array_2=np.array([i for i in range(11, 21)])

# Element-wise addition

Addition=array_1+array_2

print("Addition:\n", Addition)

# Substraction

Subtraction=array_1-array_2

print("Substraction:\n", Subtraction)

# Multiplication:

Multiplication=array_1*array_2

print("Multiplication:\n",Multiplication)

# Division:

Division=array_1/array_2

print("Division:\n",Division)



Addition:
 [12 14 16 18 20 22 24 26 28 30]
Substraction:
 [-10 -10 -10 -10 -10 -10 -10 -10 -10 -10]
Multiplication:
 [ 11  24  39  56  75  96 119 144 171 200]
Division:
 [0.09090909 0.16666667 0.23076923 0.28571429 0.33333333 0.375
 0.41176471 0.44444444 0.47368421 0.5       ]


3. **Array Slicing**: 

In [105]:
array = np.array([i for i in range(1, 11)])

print(array[0:3,])

[1 2 3]


4. **Array Reshaping**:

In [107]:
# 2 rows and 3 columns

array = np.array([i for i in range(1, 7)])

array=array.reshape(2,3)

print("Reshapesd array, with 2 rows and 3 columns:\n", array)

Reshapesd array, with 2 rows and 3 columns:
 [[1 2 3]
 [4 5 6]]


5. **Array Aggregation**:

In [118]:
from sympy import primerange # I use this to identify the prime numbers

# Generate list of first 100 prime numbers 

array = np.array([list(primerange(1,1000))[:100]]) #From a list of numbers
                                                    #from 1 to 1000, 
                                                    #I want the first 100
                                                    #numbers that are prime numbers
print("Array of first 100 prime numbers:\n",array)
print("\nMean of Array:\n", np.mean(array))
print("\nMedian of Array:\n", np.median(array))
print("\nStandard Deviation of Array:\n", np.std(array))

Array of first 100 prime numbers:
 [[  2   3   5   7  11  13  17  19  23  29  31  37  41  43  47  53  59  61
   67  71  73  79  83  89  97 101 103 107 109 113 127 131 137 139 149 151
  157 163 167 173 179 181 191 193 197 199 211 223 227 229 233 239 241 251
  257 263 269 271 277 281 283 293 307 311 313 317 331 337 347 349 353 359
  367 373 379 383 389 397 401 409 419 421 431 433 439 443 449 457 461 463
  467 479 487 491 499 503 509 521 523 541]]

Mean of Array:
 241.33

Median of Array:
 231.0

Standard Deviation of Array:
 160.02218939884557


In [116]:
print(array)

[[  2   3   5   7  11  13  17  19  23  29  31  37  41  43  47  53  59  61
   67  71  73  79  83  89  97 101 103 107 109 113 127 131 137 139 149 151
  157 163 167 173 179 181 191 193 197 199 211 223 227 229 233 239 241 251
  257 263 269 271 277 281 283 293 307 311 313 317 331 337 347 349 353 359
  367 373 379 383 389 397 401 409 419 421 431 433 439 443 449 457 461 463
  467 479 487 491 499 503 509 521 523 541]]


# Pandas Practice Questions:

1. **DataFrame Creation**: Create a pandas DataFrame from a dictionary of lists, where each list represents a column.

2. **DataFrame Operations**: Add a new column to an existing DataFrame. Perform element-wise operations between two columns.

3. **Data Selection**: Select rows based on a condition. Select specific columns from a DataFrame.

4. **Data Aggregation**: Group the data in a DataFrame based on a categorical column and calculate the mean of each group.

5. **Data Cleaning**: Handle missing values by either removing them or replacing them with appropriate values.
ate values

1. **DataFrame Creation**:

In [133]:
import pandas as pd

data={"Student": ["Carlos","Alejandro","Richard","Subi"],
      "Age": [25,26,25,22],
     "GPA":[3.2,3.8,4.0,3.76]}

dataframe=pd.DataFrame(data)

print(dataframe)

     Student  Age   GPA
0     Carlos   25  3.20
1  Alejandro   26  3.80
2    Richard   25  4.00
3       Subi   22  3.76


2. **DataFrame Operations**: 

In [134]:
# Add a new column to an existing Data Frame

dataframe["Country"]=["Mexico", "Mexico", "USA", "USA"]
print(dataframe)

     Student  Age   GPA Country
0     Carlos   25  3.20  Mexico
1  Alejandro   26  3.80  Mexico
2    Richard   25  4.00     USA
3       Subi   22  3.76     USA


In [135]:
# Perform element-wise operations between two columns.

dataframe['Age_Greater_Than_GPA'] = dataframe['Age'] > dataframe['GPA']

dataframe['Age_GPA_Ratio'] = dataframe['Age'] / dataframe['GPA']

print(dataframe)

     Student  Age   GPA Country  Age_Greater_Than_GPA  Age_GPA_Ratio
0     Carlos   25  3.20  Mexico                  True       7.812500
1  Alejandro   26  3.80  Mexico                  True       6.842105
2    Richard   25  4.00     USA                  True       6.250000
3       Subi   22  3.76     USA                  True       5.851064


3. **Data Selection**:

In [136]:
# Select rows based on a condition. Select specific columns from a DataFrame.

# Filter the dataframe to include only rows where GPA is greater than percentile 50

dataframe_filtered=dataframe[dataframe["GPA"]>np.percentile(dataframe["GPA"], 50)]

print(dataframe_filtered)

     Student  Age  GPA Country  Age_Greater_Than_GPA  Age_GPA_Ratio
1  Alejandro   26  3.8  Mexico                  True       6.842105
2    Richard   25  4.0     USA                  True       6.250000


4. **Data Aggregation**: 

In [141]:
# Group the data in a DataFrame based on a categorical column 
# and calculate the mean of each group.

grouped_data = dataframe.groupby("Country")[["Age", "GPA"]].mean()

print("\nMean GPA by Country:\n", grouped_data)


Mean GPA by Country:
           Age   GPA
Country            
Mexico   25.5  3.50
USA      23.5  3.88


5. **Data Cleaning**:

In [142]:
# Data Cleaning and Preparation

#Handling missing values
# Drop missing values

dataframe_cleaned = dataframe.dropna()
# Fill missing values

dataframe_filled = dataframe.fillna(0)
# Check for missing values

print(dataframe.isnull().any())

Student                 False
Age                     False
GPA                     False
Country                 False
Age_Greater_Than_GPA    False
Age_GPA_Ratio           False
dtype: bool
