# <div style="text-align: center"> Introduction to Python

## <div style="text-align: center">Introduction to Python (II)

In [None]:
pip install -r requirements.txt

# Libraries

## Numpy

<p>
NumPy is a powerful Python library that is widely used for scientific computing and data analysis. The library provides a multidimensional array object, various derived objects such as masked arrays and matrices, and a large collection of mathematical functions that can be used to perform complex numerical computations. 
</p>
<p>
NumPy's array object is more efficient than Python's built-in lists for numerical operations, and it allows for easy manipulation of large datasets. The library is also highly optimized for speed and can be used to perform advanced mathematical computations efficiently. NumPy is widely used in fields such as physics, engineering, and finance, and it is an essential tool for any data scientist or machine learning engineer working in Python.
</p>

<a href="https://numpy.org/doc/1.24/">Numpy Docs</a>




In [None]:
import numpy as np

### Declaring a NumPy array

In [None]:
np.array([1, 2, 3, 4, 5])

In [None]:
np.array(range(1, 10))

### Max and Min values

In [None]:
arr = np.array([1, 2, 3, 4, 5])
print(f'max: {arr.max()}, min: {arr.min()}, mean: {arr.mean()}')

### <span style="color:red">**TASK**</span>

Please find out the sum of the numbers `1,2,3,4,5,6,7,8,9,10` using the appropriate NumPy method below

In [None]:
# Code goes here

#### Create a 2D array

In [None]:
np.ones(2)

In [None]:
np.ones(2, dtype=np.int64)

In [None]:
np.zeros(3)

#### Arrange 

In [None]:
np.arange(4)

In [None]:
np.arange(2, 9, 2)

#### Linspace

In [None]:
np.linspace(0, 10, num=5)

#### Concatenating

In [None]:
np.concatenate((np.ones(3), np.zeros(3)))

#### Array size

In [None]:
arr = np.array([[[0, 1, 2, 3],
                [4, 5, 6, 7]],
                [[0, 1, 2, 3],
                  [4, 5, 6, 7]],
                [[0 ,1 ,2, 3],
                  [4, 5, 6, 7]]])


In [None]:
print(f'Dimensions: {arr.ndim}\n Shape: {arr.shape}\n Size: {arr.size}, ')

### Reshaping

In [None]:
arr = np.array([1, 2, 3, 4, 5, 6])
arr.reshape(2, 3)

### <span style="color:red">**TASK**</span>

Create a program that reads in an array of integers from the user, reshapes it into a 2D array, and then finds the mean of each row. Finally, the program should output the original 2D array and the mean of each row.

In [None]:
arr = input("Enter an array of integers, separated by spaces: ").split()

# Code goes here

### Indexing

In [None]:
arr = np.arange(1, 10)
print(f'first element: {arr[0]}')
print(f'last element: {arr[-1]}')
print (f'first 3 elements: {arr[:3]}')
print(f'every other element: {arr[::2]}')

In [None]:
a = np.array([[1 , 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

a[a < 8]

In [None]:
a[a%2==0]

In [None]:
a[(a > 2) & (a < 11)]

In [None]:
(a > 5) | (a == 5)

### Basic operations

In [None]:
data = np.array([4, 6])
ones = np.ones(2, dtype=int)
data + ones

In [None]:
data - ones

In [None]:
np.subtract(data, ones)

In [None]:
data * data

In [None]:
np.multiply(data, data)

In [None]:
data / data

In [None]:
np.divide(data, data)

In [None]:
data * 1.5

In [None]:
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])

np.dot(a, b)

In [None]:
np.cross(a, b)

### Matrix operations

In [None]:
data = np.array([[1, 2], [3, 4], [5, 6]])

<div style="display: flex;">
<p style="writing-mode: vertical-rl; margin-inline-start: 30px;">AXIS 0 --></p>
<div>
<p>AXIS 1 --></p>

<!-- style="background:red" -->
<table>
  <tr>
    <td>1</td>
    <td>2</td>
  </tr>
  <tr>
    <td>3</td>
    <td>4</td>
  </tr>
  <tr>
    <td>5</td>
    <td>6</td>
  </tr>
<table>
</div>
</div>

In [None]:
data.max(axis=0)

In [None]:
data.max(axis=1)

### <span style="color:red">**TASK**</span>

Write a Python program that performs the following operations on a given matrix:

```
[[1 2 3]
 [4 5 6]]
```

1. Define `matrix` as a 2D NumPy array with the given values
2. Reshape the matrix into a 1D array
3. Calculate the mean of the 1D array
4. Calculate the standard deviation of the 1D array
5. Reshape the 1D array into a 2D matrix with the same number of rows and columns as the original matrix
6. Calculate the element-wise multiplication of the original matrix and the reshaped 2D matrix
7. Calculate the sum of all elements in the element-wise multiplied matrix
8. Output the original matrix, the reshaped 1D array, the mean and standard deviation of the 1D array, the reshaped 2D matrix, the element-wise multiplied matrix, and the sum of all elements in the element-wise multiplied matrix.

In [None]:
# Code Goes here

print(f"Original matrix: {matrix}")
print(f"Reshaped 1D array: {arr}")
print(f"Mean of 1D array: {mean}")
print(f"Standard deviation of 1D array: {std}")
print(f"Reshaped 2D matrix: {new_matrix}")
print(f"Element-wise multiplied matrix: {mult_matrix}")
print(f"Sum of all elements in element-wise multiplied matrix: {sum_mult_matrix}")


### Random numbers and arrays

In [None]:
np.random.randint(1, 10, size=(2, 3))

### Comparisons

In [None]:
a = np.array([1, 2, 3, 4, 5])
b = np.array([2, 2, 3, 3, 5])

np.equal(a, b)

## Pandas
---

Pandas is a popular open-source Python library used for data manipulation and analysis. It provides high-level data structures and functions for working with structured data, such as tabular data, time series, and matrices. Pandas is built on top of the NumPy library, and provides additional functionality specifically tailored to handling labeled and relational data.

Pandas provides two primary data structures: Series and DataFrame. A Series is a one-dimensional labeled array capable of holding any data type, while a DataFrame is a two-dimensional table-like structure consisting of rows and columns, similar to a spreadsheet or SQL table.

Pandas provides a wide range of data manipulation and analysis tools, including the ability to read and write data in various formats such as CSV, Excel, SQL databases, and JSON, as well as data cleaning, merging, filtering, and transformation functions. It also provides advanced functionality such as pivot tables, groupby operations, time series analysis, and statistical analysis.

Pandas has become a widely used tool in data science and machine learning workflows, due to its ability to handle large datasets efficiently, its flexibility for working with different types of data, and its integration with other popular Python libraries such as NumPy, Matplotlib, and Scikit-learn

<a href="https://pandas.pydata.org/docs/">Pandas Docs</a>

[Dataset for this class](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/07_Visualization/Tips/tips.cs)

In [None]:
import pandas as pd

### Create a DataFrame

In [None]:
data = {'Name': ['John', 'Alice', 'Bob'], 'Age': [25, 30, 35], 'Gender': ['M', 'F', 'M']}
df = pd.DataFrame(data)

In [None]:
df

### Viewing Data

In [None]:
df.head()

In [None]:
df.tail(1)

### Dataframe data types

In [None]:
df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)

df2['E']
df2.dtypes

### Date Range and Time Series

In [None]:
dates = pd.date_range("20230318", periods=10)

In [None]:
dates

In [None]:
np_df = pd.DataFrame(np.random.randn(10, 4), index=dates, columns=list("ABCD"))

In [None]:
np_df.index

In [None]:
np_df.columns

In [None]:
np_df.to_numpy()

In [None]:
np_df.describe()

In [None]:
np_df.T

In [None]:
np_df.sort_index(axis=0, ascending=False)

In [None]:
np_df.sort_values(by=["C"])

### Selecting data
#### Access a single column


In [None]:
df['Name']

#### Access multiple columns


In [None]:
age_gender_columns = df[['Age', 'Gender']]
age_gender_columns

#### Access a single row

`at()`, `iat()`, `loc()` and `iloc()` functions are used to access a single value for a row/column label pair. `at()` and `iat()` are faster than `loc()` and `iloc()`.

In [None]:
row_0 = df.loc[0]
row_0

#### Access multiple rows

In [None]:
rows_1_2 = df.loc[1:2]
rows_1_2

#### Access a single value


In [None]:
age_0 = df.loc[0, 'Age']
age_0

In [None]:
ages = df.iloc[:, 1]
ages


In [None]:
df.iloc[0:2, 0:2]

In [None]:
df.at[0, 'Age']

In [None]:
df.at[0, 'Age']

In [None]:
df.iat[0, 1]

#### <span style="color:red">**TASK**</span>

1. Create a 3x3 numpy array of with random values and convert it to a pandas dataframe with column names as `A`, `B`, and `C`.
2. Access the value at row 1, column `B`.

In [None]:
# Code goes here

#### Boolean Indexing

In [None]:
df[df['Age'] > 25]

In [None]:
df[df['Gender'] == "F"][['Name', 'Age']]

In [None]:
df[df['Name'].isin(['Alice', 'Bob'])]

#### <span style="color:red">**TASK**</span>

1. From the `np_df` dataframe, select the rows where the value of column `B` is greater than 0 and less than 1.

In [79]:
# Code goes here

#### Setting values

In [None]:
df.at[0, "Age"] = 18

In [None]:
df.iat[0, 0] = 'Devin'

In [None]:
df.loc[:, "Salary"] = np.random.randint(10000, 50000, size=(3,))

#### <span style="color:red">**TASK**</span>

1. Use `np_df` dataframe - multiply the values in column `B` by 10 and assign the result to column `B`.
2. Find values in column `B` that are greater than 5 and replace them with 0.

In [None]:
# Code goes here

#### Missing Data

In [None]:
dates = pd.date_range("20230318", periods=10)
missing_data = pd.DataFrame(np.random.randn(10, 4), index=dates, columns=list("ABCD"))
missing_data.loc[dates[0]: dates[4], 'E'] = 1

In [None]:
missing_data.dropna(how="any")

In [None]:
missing_data.fillna(value=3)

In [None]:
pd.isna(missing_data)

#### Reading data from sources

In [None]:
import requests
import io

URL = 'https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/07_Visualization/Tips/tips.csv'
s=requests.get(URL).content

tips_df = pd.read_csv(io.StringIO(s.decode('utf-8')), index_col=0)

#### Statistical operations

In [None]:
tips_df.mean()

In [None]:
tips_df.describe()

In [None]:
tips_df['tip'].value_counts(dropna=False)

In [None]:
tips_df[['total_bill', 'tip', 'size']].apply(np.cumsum)

In [None]:
tips_df[['total_bill', 'tip', 'size']].corr()

In [None]:
tips_df

#### <span style="color:red">**TASK**</span>

Use `tips_df` dataframe
1. Create new columns `tip_percentage` and `bill_per_person` and calculate the values for each row.

In [None]:
# Code goes here

#### Grouping

In [None]:
import matplotlib.pyplot as plt

In [None]:
mean_tip_by_day = tips_df.groupby('day')['tip'].sum()
mean_tip_by_day.plot(kind='bar')
plt.xlabel('Day of the week')
plt.ylabel('Mean tip')
plt.show()

In [None]:
mean_total_bill_by_day = tips_df.groupby('day')['total_bill'].mean()
mean_total_bill_by_day.plot(kind='bar')
plt.xlabel('Day of the week')
plt.ylabel('Mean total bill')
plt.show()

#### <span style="color:red">**TASK & HOMEWORK**</span>

Use `tips_df` dataframe
1. Find out the average `tip` for each `sex`, `smoker`, `time` and `day`.
2. Combine all mentioned columns into a single combination to find out the average `tip` for each combination.
3. Display the highest and lowest `tip` for each `day`.
4. Count the number of `smoker` and `non-smoker` for each `day`.
5. Find out the average `total_bill` for each `day` and `time` combination.
6. Use the describe function to get the statistics for `total_bill` for each `day` and `time` combination.

In [None]:
# Code goes here