# **Guided Lab - 343.2.1 - NumPy Random module - Random Number Generation**

In [None]:
# Suppose you have a year's worth of daily temperature data in Celsius
temperature_data = np.array([22.3, 23.1, 24.5, 25.8, 23.6, 26.7, 27.9, 29.2, 30.5, 24.7, 23.4, 22.1, 25.3, 26.4, 28.7, 29.8, 31.2, 32.4, 30.7, 29.5, 27.8, 26.6, 23.9, 22.5, 24.1, 25.7, 27.3, 29.6, 31.0, 33.1, 31.9])


# Calculate the mean, median, and standard deviation
mean_temperature = np.mean(temperature_data)
median_temperature = np.median(temperature_data)
std_deviation_temperature = np.std(temperature_data)


print("Mean Temperature:", mean_temperature)
print("Median Temperature:", median_temperature)
print("Standard Deviation:", std_deviation_temperature)


# Convert temperatures to Fahrenheit
temperature_data_fahrenheit = (temperature_data * 9/5) + 32
print("Fahrenheit temp:\n", temperature_data_fahrenheit)


# Find days with temperatures above a certain threshold (e.g., 30°C)
hot_days = temperature_data[temperature_data > 30]
print("List of hot days above 30 C:\n", hot_days)


# Count the number of hot days
num_hot_days = len(hot_days)
print("Number of Hot Days:", num_hot_days)


# Calculate the total cooling degree days for the year
# Cooling degree days represent the cumulative amount of cooling required to maintain a comfortable indoor temperature.
# In this example, we'll consider a base temperature of 20°C.
base_temperature = 20
cooling_degree_days = np.sum(np.maximum(temperature_data - base_temperature, 0))


print("Cooling Degree Days:", cooling_degree_days)

Mean Temperature: 27.138709677419353
Median Temperature: 26.7
Standard Deviation: 3.221668313492935
Fahrenheit temp:
 [72.14 73.58 76.1  78.44 74.48 80.06 82.22 84.56 86.9  76.46 74.12 71.78
 77.54 79.52 83.66 85.64 88.16 90.32 87.26 85.1  82.04 79.88 75.02 72.5
 75.38 78.26 81.14 85.28 87.8  91.58 89.42]
List of hot days above 30 C:
 [30.5 31.2 32.4 30.7 31.  33.1 31.9]
Number of Hot Days: 7
Cooling Degree Days: 221.3


## **Lab Overview:**

In this lab, we will explore the functionality of several important random number generation functions provided by the NumPy library in Python. These functions include `np.random.choice()` for generating random samples from `arrays`, `np.random.shuffle() for shuffling the contents of arrays, and `np.random.randn()` for generating random numbers from a standard normal distribution. Through hands-on exercises, participants will gain a understanding of how to use these functions effectively in various scenarios.

## **Lab Objective:**

By the end of this lab, participants will:

- Describe the purpose and the usage of key random number generation functions provided by NumPy, including np.random.choice(), np.random.shuffle(), and np.random.randn().
- Demonstrate how to generate random samples from arrays, shuffle array contents, and generate random numbers from a standard normal distribution.


## **Introduction**
The syntax of `np.random.randn()` typically involves passing the desired dimensions of the array as arguments. For example:

```np.random.randn() - generates a single random number```

```np.random.randn(n) - generates an array of n random numbers.```

``` np.random.randn(m, n) - generates a 2D array with m rows and n columns of random numbers.```

Note: These random numbers are will be drawn from a standard normal distribution, where the mean is 0 and the standard deviation is 1.






## **Example 1: Generates a single random number.**

in this example we will generate a single random number by using the random() function. The random() function returns a random number within the range of 0 to 1.0.

In [None]:
import numpy as np
from numpy import random

# 10 random floats between 0 and 1
data = random.rand(10) 
print("Random number sampled from a standard normal distribution:",data)
print()

# 7 random integers between 0 and 5, returns one integer without arguments, size specifies the shape of an array
data2 = random.randint(5, size=(7)) 
print("Random number sampled from a standard normal distribution:\n",data2, 42)

data3 = random.randint(5, size=(3, 4)) 
print("Random number sampled from a standard normal distribution:\n",data3)

Random number sampled from a standard normal distribution: [0.10108953 0.24906625 0.79648364 0.58939269 0.99779708 0.57535609
 0.58763224 0.29679178 0.42867092 0.92937507]

Random number sampled from a standard normal distribution:
 [1 2 0 0 3 1 2] 42
Random number sampled from a standard normal distribution:
 [[1 1 2 1]
 [0 4 0 0]
 [4 1 0 2]]


## **Example 2: Generates an array of n random numbers.**
In this example, we generate an array of n random numbers by using the `np.random.randn(n)` function. The function returns an array of n random numbers with mean 0 and standard deviation 1.


In [None]:
n = 5
random_numbers = random.randn(n)
print("Array of", n, "random numbers sampled from a standard normal distribution:")
print(random_numbers)


Array of 5 random numbers sampled from a standard normal distribution:
[ 1.74597212 -0.04976857 -0.81894081 -0.71959882  0.36540445]


## **Example 3: Generates a 2D array with m rows and n columns of random numbers.**

In this example, we will generate a 2D array with 3 rows and 4 columns of random numbers. The result will be a 3x4 2D array with random numbers ranging from 0 to 99.    

In [None]:
m = 3
n = 2
random_numbers_2d = random.randn(m, n)
print("2D array of random numbers sampled from a standard normal distribution (", m, "rows x", n, "columns):")
print(random_numbers_2d)

m1 = 4
n1 = 2
random_numbers2_2d = random.randn(m1, n1)
print("2D array of random numbers sampled from a standard normal distribution (", m1, "rows x", n1, "columns):\n", random_numbers2_2d)

2D array of random numbers sampled from a standard normal distribution ( 3 rows x 2 columns):
[[-0.76477598  0.19917108]
 [-0.99037421  0.98610926]
 [ 0.25248891  0.12892551]]
2D array of random numbers sampled from a standard normal distribution ( 4 rows x 2 columns):
 [[0.29538017 0.86670919]
 [0.21590136 0.02381745]
 [0.58687493 1.24131465]
 [0.06618132 0.78091248]]


## **Example 4: Generates a random sample from a given 1-D array.**
In this example, we will generate a random sample from a given 1-D array. We will use the numpy.random.choice() function to do this.

In [None]:
from numpy import random

# Define an array of elements
elements = ['a', 'b', 'c', 'd', 'e']

# Generate a random sample from the array
random_sample = random.choice(elements)
print("Randomly sampled element:", random_sample)

# The probability is set by a number between 0 and 1, where 0 means that the value will never occur and 1 means that the value will always occur.
# The probability of 'b' and 'd' are .5 and .4, use size to specify the array shape
random_sample_2 = random.choice(elements, p=[0.0, 0.5, 0.0, 0.4, 0.1], size=(10))
print("Randomly sampled element:", random_sample_2)


Randomly sampled element: c
Randomly sampled element: ['b' 'b' 'b' 'b' 'd' 'd' 'd' 'd' 'd' 'd']


## **Example 5: Shuffles the contents of an array in place.**
In this example, we will shuffle the contents of an array. We will use the np.random.shuffle() function.

In [None]:
# Define an array of elements
elements = ['a', 'b', 'c', 'd', 'e']

# Shuffle the array in place
random.shuffle(elements)
print("Shuffled array:", elements)


Shuffled array: ['e', 'c', 'a', 'd', 'b']


## **Example 6: Real world example -Restauran**t Menu Randomizer

Suppose you're designing a digital menu for a restaurant, and you want to create a feature that suggests a random dish to the user when they're undecided about what to order. You can use the random module to implement this feature.

In [None]:
# Define a list of dishes on the menu
menu = [
    "Spaghetti Carbonara",
    "Chicken Alfredo",
    "Margherita Pizza",
    "Cheeseburger",
    "Caesar Salad",
    "Fish and Chips",
    "Pad Thai",
    "Sushi Platter",
    "Vegetable Stir-Fry",
    "Grilled Salmon"
]

# Function to suggest a random dish
def suggest_dish():
    random_dish = random.choice(menu)
    return random_dish

# Main program
print("Welcome to the Restaurant Menu Randomizer!\n")
print("If you're undecided about what to order, let us help you decide.\n")

while True:
    user_input = input("Press enter to get a random dish suggestion (or type 'quit' to exit): \n")

    if user_input.lower() == 'quit':
        print("Thank you for using the Restaurant Menu Randomizer. Enjoy your meal!")
        break

    suggested_dish = suggest_dish()
    print("Randomly suggested dish:", suggested_dish, "\n")


Welcome to the Restaurant Menu Randomizer!

If you're undecided about what to order, let us help you decide.



Randomly suggested dish: Sushi Platter 

Randomly suggested dish: Pad Thai 

Randomly suggested dish: Grilled Salmon 

Randomly suggested dish: Margherita Pizza 


Thank you for using the Restaurant Menu Randomizer. Enjoy your meal!


***

# **Guided Lab - 343.2.2 - NumPy and Mathematical Calculation**

## **Objectives**
By the end of this lab, Learners will be able to:
- Describe the basics of NumPy.
- Manipulate data using NumPy arrays.
- Perform element-wise mathematical operations on arrays.
- Use NumPy functions for statistical and mathematical calculations.
- Slice and index NumPy arrays for data extraction.
- Apply common data science tasks using NumPy.

## **Introduction**
NumPy (Numerical Python) is a fundamental library for numerical computing in Python. It provides support for multidimensional arrays, mathematical functions, and a wide range of operations on arrays. In this guided lab, you will dive into the world of NumPy and learn how to efficiently manipulate arrays, perform mathematical operations, and solve common data science tasks using this powerful library.

## **Instructions**

**Example 1: Import NumPy**

Open a Python environment (e.g., Jupyter Notebook).
Import the NumPy library using **`import numpy as np.`**

In [None]:
import numpy as np

**Example 2: Creating NumPy Arrays**


*   Create a 1D NumPy array from a Python list.
*   Create a 2D NumPy array from a nested Python list.
* Explore the attributes of NumPy arrays such as shape, size, and data type.






In [None]:
# Create a 1D NumPy array from a Python list
python_list_1d = [1, 2, 3, 4, 5]
numpy_array_1d = np.array(python_list_1d)

# Create a 2D NumPy array from a nested Python list
python_list_2d = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
numpy_array_2d = np.array(python_list_2d)

# Explore the attributes of NumPy arrays
shape_1d = numpy_array_1d.shape
shape_2d = numpy_array_2d.shape

size_1d = numpy_array_1d.size
size_2d = numpy_array_2d.size

data_type_1d = numpy_array_1d.dtype
data_type_2d = numpy_array_2d.dtype

print("1D NumPy Array:")
print("Array:", numpy_array_1d)
print("Shape:", shape_1d)
print("Size:", size_1d)
print("Data Type:", data_type_1d)

print("\n2D NumPy Array:")
print("Array:")
print(numpy_array_2d)
print("Shape:", shape_2d)
print("Size:", size_2d)
print("Data Type:", data_type_2d)


1D NumPy Array:
Array: [1 2 3 4 5]
Shape: (5,)
Size: 5
Data Type: int64

2D NumPy Array:
Array:
[[1 2 3]
 [4 5 6]
 [7 8 9]]
Shape: (3, 3)
Size: 9
Data Type: int64


**Example 3: Statistical and Mathematical Functions**
- Perform element-wise addition, subtraction, multiplication, and division on arrays.
- Apply basic mathematical functions (e.g., np.sin(), np.exp()) to arrays.


In [None]:
# Perform element-wise operations on arrays
addition_result = numpy_array_1d + 10  # Adds 10 to each element/value in the array
subtraction_result = numpy_array_1d - 2  # Subtracts 2 from each element/value in the array
multiplication_result = numpy_array_1d * 3  # Multiplies each element/value in the array by 3
division_result = numpy_array_1d / 2  # Divides each element/value in the array by 2

print("\nElement-wise Operations on 1D Array:")
print("Original Array:", numpy_array_1d)
print("Addition Result:", addition_result)
print("Subtraction Result:", subtraction_result)
print("Multiplication Result:", multiplication_result)
print("Division Result:", division_result)

# Apply basic mathematical functions to arrays
sin_result = np.sin(numpy_array_1d)  # The sine is one of the fundamental functions of trigonometry (the mathematical study of triangles)
exp_result = np.exp(numpy_array_1d)  # Calculate the exponential of all elements in the input array

print("\nMathematical Functions Applied to 1D Array:")
print("Original Array:", numpy_array_1d)
print("sin Result:", sin_result)
print("exp Result:", exp_result)


Element-wise Operations on 1D Array:
Original Array: [1 2 3 4 5]
Addition Result: [11 12 13 14 15]
Subtraction Result: [-1  0  1  2  3]
Multiplication Result: [ 3  6  9 12 15]
Division Result: [0.5 1.  1.5 2.  2.5]

Mathematical Functions Applied to 1D Array:
Original Array: [1 2 3 4 5]
sin Result: [ 0.84147098  0.90929743  0.14112001 -0.7568025  -0.95892427]
exp Result: [  2.71828183   7.3890561   20.08553692  54.59815003 148.4131591 ]


**Example 4: Statistical and Mathematical Functions**
- Calculate the mean, median, and standard deviation of an array.
- Find the minimum and maximum values in an array.

In [None]:
# Apply basic mathematical functions to arrays
sin_result = np.sin(numpy_array_1d)
exp_result = np.exp(numpy_array_1d)

print("\nMathematical Functions Applied to 1D Array:")
print("Original Array:", numpy_array_1d)
print("sin Result:", sin_result)
print("exp Result:", exp_result)

# Calculate statistics on the 1D array
mean_value = np.mean(numpy_array_1d)
median_value = np.median(numpy_array_1d)
std_deviation = np.std(numpy_array_1d)
min_value = np.min(numpy_array_1d)
max_value = np.max(numpy_array_1d)

print("\nStatistics on 1D Array:")
print("Mean:", mean_value)
print("Median:", median_value)
print("Standard Deviation:", std_deviation)
print("Minimum Value:", min_value)
print("Maximum Value:", max_value)


Mathematical Functions Applied to 1D Array:
Original Array: [1 2 3 4 5]
sin Result: [ 0.84147098  0.90929743  0.14112001 -0.7568025  -0.95892427]
exp Result: [  2.71828183   7.3890561   20.08553692  54.59815003 148.4131591 ]

Statistics on 1D Array:
Mean: 3.0
Median: 3.0
Standard Deviation: 1.4142135623730951
Minimum Value: 1
Maximum Value: 5


**Example 5: Slicing and Indexing**
- Slice arrays to extract specific elements or sub-arrays.
- Use boolean indexing to filter elements based on conditions.
- Combine slicing and indexing techniques to manipulate arrays effectively.


In [None]:
# Slicing and Indexing
sliced_array = numpy_array_1d[1:4]  # Slice elements from index 1 to 3
index_condition = numpy_array_1d > 3  # Create a boolean index condition

print("\nSliced Array:")
print(sliced_array)

print("\nBoolean Indexing:")
print("Original Array:", numpy_array_1d)
print("Index Condition:", index_condition)
print("Filtered Elements:", numpy_array_1d[index_condition])

# Combining Slicing and Indexing
combined_result = numpy_array_1d[1:4][numpy_array_1d[1:4] > 3]

print("\nCombined Slicing and Indexing:")
print("Original Array:", numpy_array_1d)
print("Filtered Elements:", combined_result)



Sliced Array:
[2 3 4]

Boolean Indexing:
Original Array: [1 2 3 4 5]
Index Condition: [False False False  True  True]
Filtered Elements: [4 5]

Combined Slicing and Indexing:
Original Array: [1 2 3 4 5]
Filtered Elements: [4]


**Example 6:  Real-world example of using NumPy**

Let's consider a real-world example of using NumPy for data manipulation in the context of analyzing temperature data from a weather station.

You have a dataset with daily temperature records for a year, and you want to perform various operations and calculations on this data.

In the following example, we will use NumPy to:
- Calculate the mean, median, and standard deviation of the daily temperature data.

- Convert the temperatures from Celsius to Fahrenheit.
- Identify hot days with temperatures above 30°C.
- Count the number of hot days.
- Calculate cooling degree days, a measure of cooling requirements based on a specified base temperature.

These operations demonstrate how NumPy can simplify data manipulation and analysis, making it a powerful tool for working with real-world data in various scientific and engineering fields, including meteorology and climate science.


***

# PA 343.2.1 NumPy
## Task 1: Data Type Objects
1. Create a NumPy array using np.array() function and specify the data type as 'int32.' Print the array and its data type.
1. Create another NumPy array using np.array() with the data type as 'float64.' Print the array and its data type.

In [None]:
import numpy as np

In [None]:
arr = np.array([1, 2, 3, 4, 5, 6, 7], dtype='int32')
print(arr)
print(arr.dtype)

[1 2 3 4 5 6 7]
int32


## Task 2: One-Dimensional and Multi-Dimensional Arrays
1. Create a one-dimensional NumPy array containing the integers from 1 to 5. Print the array.
1. Create a two-dimensional NumPy array (matrix) with the following values:

In [None]:
arr_1d = np.array([1, 2, 3, 4, 5])
print(arr_1d)

arr_2d = np.array([[6, 7, 8, 9], [10, 11, 12, 13]])
print(arr_2d)

[1 2 3 4 5]
[[ 6  7  8  9]
 [10 11 12 13]]


## Task 3: NumPy Operations
1. Create two NumPy arrays arr1 and arr2 with any values of your choice. Perform element-wise addition, subtraction, multiplication, and division between these arrays and print the results.
1. Calculate the dot product between arr1 and arr2. Print the result.
1. Use NumPy to calculate the mean, median, and standard deviation of arr1.

In [None]:
arr1 = np.array([2, 4, 6, 8])
arr2 = np.array([10, 12, 14, 16])

add_arr1 = arr1 + 2
sub_arr1 = arr1 - 1
mult_arr1 = arr1 * 10
div_arr1 = arr1 / 2
dot_arr1 = np.dot(arr1, arr2)
mean_arr1 = np.mean(arr1)
median_arr1 = np.median(arr1)
std_arr1 = np.std(arr1)

print("Addition: \n", add_arr1, "\nSubtraction:\n", sub_arr1, "\nMultiplication:\n", mult_arr1, "\nDivision:\n", div_arr1)
print("\nDot product:\n", dot_arr1)
print("\nMean:\n", mean_arr1, "\nMedian:\n", median_arr1, "\nStandard Deviation:\n", std_arr1)

add_arr2 = arr2 + 2
sub_arr2 = arr2 - 1
mult_arr2 = arr2 * 10
div_arr2 = arr2 / 2
dot_arr2 = np.dot(arr1, arr2)
mean_arr2 = np.mean(arr2)
median_arr2 = np.median(arr2)
std_arr2 = np.std(arr2)

print("\nAddition: \n", add_arr2, "\nSubtraction:\n", sub_arr2, "\nMultiplication:\n", mult_arr2, "\nDivision:\n", div_arr2)
print("\nDot product:\n", dot_arr2)
print("\nMean:\n", mean_arr2, "\nMedian:\n", median_arr2, "\nStandard Deviation:\n", std_arr2)

Addition: 
 [ 4  6  8 10] 
Subtraction:
 [1 3 5 7] 
Multiplication:
 [20 40 60 80] 
Division:
 [1. 2. 3. 4.]

Dot product:
 280

Mean:
 5.0 
Median:
 5.0 
Standard Deviation:
 2.23606797749979

Addition: 
 [12 14 16 18] 
Subtraction:
 [ 9 11 13 15] 
Multiplication:
 [100 120 140 160] 
Division:
 [5. 6. 7. 8.]

Dot product:
 280

Mean:
 13.0 
Median:
 13.0 
Standard Deviation:
 2.23606797749979


***

# **Guided Lab - 343.3.1 - Panda Series**



**Lab Objective and Overview:**

- In this lab, we will demonstrate the Panda Series data structure. The lab guides you through creating and manipulating Series using various methods. The Series is a fundamental component in data analysis with Pandas, offering a one-dimensional labeled array capable of holding various data types.

- Each example provides a clear demonstration of a specific concept.

**Key Topics**

- **Series Creation:** Learn how to create Series using dictionaries, scalar values, and NumPy arrays.

- **Indexing:** Understand how to select specific elements in a Series using labels.

- **Handling Missing Data:** Work with NaN (Not a Number) values that represent missing data.

**Learning Objective:**

By the end of this lab, you will be able to:
- Describe the concept of a Pandas Series.
- Create a Series from a dictionary, scalar value, and ndarray.
- Access and manipulate elements within a Series.
 - Utilize the Panda Series data structure.

**Introduction:**

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:

### **`newSeries = pd.Series (data,index)`**

Here, data can be many different things:

- a Python dict.

- an ndarray.
- a scalar value (like 5).

The passed index is a list of axis labels. Thus, this separates into a few cases depending on what data is:

---
## **Begin:**


## **Example 1: Creating Panda Series from Dictionary**

Series can be instantiated from dicts:

### **Example: 1.1**




In [None]:
import pandas as pd

In [None]:
# Keys of the dictionary become the labels
d = {"b": 1, "a": 0, "c": 2, "d":100}
print(pd.Series(d))
print()
# Use index to select specific items in the dictionary
e = pd.Series(d, index = ['a', 'c'])
print(e)

b      1
a      0
c      2
d    100
dtype: int64

a    0
c    2
dtype: int64


### **Example: 1.2**




In [None]:
d = {"a": 0.0, "b": 1.0, "c": 2.0}
pd.Series(d)

a    0.0
b    1.0
c    2.0
dtype: float64

### **Example: 1.3**

If an index is passed, the values in data corresponding to the labels in the index will be pulled out.





In [None]:
pd.Series(d, index=["b", "c", "d", "a"])

b      1
c      2
d    100
a      0
dtype: int64

**Note:** NaN (not a number) is the standard missing data marker used in pandas.

## **Example 2: Key/Value Objects as Series**

**Example 2.1**

Note: The keys of the dictionary become the labels.


In [None]:
calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories)

print(myvar)

day1    420
day2    380
day3    390
dtype: int64


**Example 2.2**

To select only some of the items in the dictionary, use the index argument and specify only the items you want to include in the Series.


Create a Series using only data from "day1" and "day2":


In [None]:
calories = {"day1": 420, "day2": 380, "day3": 390}
myvar = pd.Series(calories, index = ["day1", "day2"])
myvar2 = pd.Series(calories, index = ["day3"])
print(myvar)
print(myvar2)

day1    420
day2    380
dtype: int64
day3    390
dtype: int64


**Example 3:  Creating Panda Series from Scalar Value**

If data is a scalar value, an index must be provided. The value will be repeated to match the length of the index.




In [None]:
pd.Series(5.0, index=["a", "b", "c", "d", "e"])

a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

***

# **Guided Lab - 343.3.2 - Creating a Pandas DataFrame**


**Lab Objective:**

In this lab, we will demonstrate how to create a Pandas Dataframe, a fundamental data structure in data analysis with Python.

**Importance:** Mastering DataFrame creation is crucial for data manipulation, analysis, and visualization in Python. It's the foundation for working with data in Pandas.

**Learning Objective:**

 By the end of this lab, you will be able to create DataFrames using various methods, including dictionaries, lists, and NumPy arrays.

**Instructions:**

You can start by importing pandas along with NumPy, which you will use throughout the following examples:
```
import numpy as np
import pandas as pd

That’s it. Now you’re ready to create some DataFrames.


**Example 1: Creating a Pandas DataFrame from Dictionaries**

We can create a Pandas DataFrame with a Python dictionary:


In [None]:
import numpy as np
import pandas as pd

In [None]:
# Create Python dictionary
d = {'x': [1, 2, 3], 'y': [2, 4, 8], 'z': 100}

# Create DataFrame from a dictionary
df_d = pd.DataFrame(d)
print(df_d)

   x  y    z
0  1  2  100
1  2  4  100
2  3  8  100


The keys of the dictionary are the DataFrame’s column labels, and the dictionary values are the data values in the corresponding DataFrame columns.

The values can be contained in a tuple, list, one-dimensional NumPy array, Pandas Series object, or one of several other data types. You can also provide a single value that will be copied along the entire column.

It’s possible to control the order of the columns with the columns parameter and row labels with index as shown in the below example:





In [None]:
df_d2 = pd.DataFrame(d, index=[100, 200, 300], columns=['z', 'y', 'x'])
print(df_d2)

       z  y  x
100  100  2  1
200  100  4  2
300  100  8  3


**Example 2.1: Creating a Pandas DataFrame from lists using zip() function**

We can also use the **`zip()`** function to zip together multiple lists to create a DataFrame with more columns.


In [None]:
# create a list of patientID, name, and date of birth and assign it to a variable
patientID = [101,23,48,49]
name =       ['alice','bob','charlie','Eric']
# create a list of dates
date_of_birth = ['2023-01-01', '2023-01-02', '3/10/2020 143045', '13th of October, 2023']
# Create a DataFrame using zip and pd.DataFrame
myDF = pd.DataFrame(zip(patientID, name,date_of_birth), columns=['Patient ID', 'Name', 'DoB'])
myDF



Unnamed: 0,Patient ID,Name,DoB
0,101,alice,2023-01-01
1,23,bob,2023-01-02
2,48,charlie,3/10/2020 143045
3,49,Eric,"13th of October, 2023"


**Explanation:**
- `zip(patientID, name, date_of_birth):` The zip() function combines the elements from the three lists into tuples. Each tuple represents a row of data, associating a patient ID, name, and date of birth.
- `pd.DataFrame(...):` This creates a Pandas DataFrame using the output of zip() as the data source.
- `columns=['patientID', 'name', 'date_of_birth']:` This argument sets the column names for the DataFrame.

**Example 2.2: Creating a Pandas DataFrame from List using Dictionary**

- Another way to create a Pandas DataFrame is to use a **list** of **dictionaries**:
- To use lists in a dictionary to create a Pandas DataFrame, we Create a dictionary of lists and then Pass the dictionary to the pd.DataFrame() constructor. Optionally, we can specify the column names for the DataFrame by passing a list of strings to the columns parameter of the pd.DataFrame() constructor.




In [None]:

l = [{'x': 1, 'y': 2, 'z': 100},
     {'x': 2, 'y': 4, 'z': 100},
     {'x': 3, 'y': 8, 'z': 100}]

pd.DataFrame(l)

Unnamed: 0,x,y,z
0,1,2,100
1,2,4,100
2,3,8,100


Again, the dictionary keys are the column labels, and the dictionary values are the data values in the DataFrame.

You can also use a **nested list,** or a **list of lists**, as the data values. If you do, then it is wise to explicitly specify the labels of columns, rows, or both when you create the DataFrame.


In [None]:
l = [[1, 2, 100],
     [2, 4, 100],
     [3, 8, 100]]

pd.DataFrame(l, columns=['x', 'y', 'z'])



Unnamed: 0,x,y,z
0,1,2,100
1,2,4,100
2,3,8,100


That is how you can use a nested list to create a Pandas DataFrame. You can also use a list of tuples in the same way. To do so, just replace the nested lists in the example above with tuples.


**Example 2.3: Creating Pandas using Lists**

In [None]:
stocks = ["IBM", "APPLE", "TWTTR", "GE", "MSFT"]
prices = [115.00, 119.14, 19.77, 25.99, 26]

print(pd.DataFrame(zip(stocks, prices), columns=['stocks', 'prices']))

# Using Tuples
stocks1 = ("IBM", "APPLE", "TWTTR", "GE", "MSFT")
prices1 = (115.00, 119.14, 19.77, 25.99, 26)
print("Tuples:")
print(pd.DataFrame(zip(stocks1, prices1), columns=['stocks', 'prices']))


  stocks  prices
0    IBM  115.00
1  APPLE  119.14
2  TWTTR   19.77
3     GE   25.99
4   MSFT   26.00
Tuples:
  stocks  prices
0    IBM  115.00
1  APPLE  119.14
2  TWTTR   19.77
3     GE   25.99
4   MSFT   26.00


**Example 3: Creating a pandas DataFrame from NumPy Arrays**

You can pass a two-dimensional NumPy array to the DataFrame constructor the same way you do with a list:

In [None]:
# This following line creates a NumPy array named arr.
arr = np.array([[1, 2, 100],[2, 4, 100],[3, 8, 100]])
# This following line creates a Pandas DataFrame named df
df = pd.DataFrame(arr, columns=['x', 'y', 'z'], copy=True)
df


Unnamed: 0,x,y,z
0,1,2,100
1,2,4,100
2,3,8,100


Although this example looks almost the same as the nested list implementation above, it has one advantage. You can specify the optional parameter copy.

When a ***copy*** is set to ***False*** (its default setting), the data from the NumPy array is not copied. This means that the original data from the array is assigned to the Pandas DataFrame. If you modify the array, your DataFrame will change too:




In [None]:
arr[0, 0] = 1000
print(df, "\n")

print(arr)


   x  y    z
0  1  2  100
1  2  4  100
2  3  8  100 

[[1000    2  100]
 [   2    4  100]
 [   3    8  100]]


Note: Not copying data values can save you a significant amount of time and processing power when working with large datasets.


If this behavior is not what you want, you should specify copy=True in the DataFrame constructor. That way, df will be created with a copy of the values from arr instead of the actual values.


***

# **Guided LAB - 343.3.3 - Reading HTML tables with Pandas**

## **Learning Objective:**

- In this lab, we will demonstrate how to use pandas `read_html()` function to read several  HTML tables from Wikipedia webpage.

- By the end of this lab, learner will be able to:
 - Use the read_html() function to extract tables from HTML.
 - Select specific tables using the match parameter.
 - Convert HTML tables into pandas DataFrames for further analysis.

## **Introduction:**

You can use the **read_html()** function to scrape HTML tables directly from a website and convert them into DataFrames.

The **read_html()** function takes a string that contains a URL or a file path leading to an HTML file, extracts all the tables contained within that HTML page, and returns a list of DataFrames.

## **Example: 1**
For the first example, we will try to parse this table from the Politics section on the Minnesota wiki page.

In [None]:
import pandas as pd
import numpy as np

In [None]:
html_string = 'https://en.wikipedia.org/wiki/Minnesota'
table_MN = pd.read_html(html_string)

In [None]:
print(table_MN)

In [None]:
# Access third table on page, Location column
print(table_MN[2]['Location'])

0            Minneapolis
1             Saint Paul
2              Rochester
3                 Duluth
4              St. Cloud
5                Mankato
6    International Falls
Name: Location, dtype: object


In [None]:
print(f'Total tables: {len(table_MN)}')

Total tables: 30


- There are 29 tables, it can be challenging to find the one you need. To make the table selection easier, use the **match** parameter to select a subset of tables.

- We need only table that contains the word **“United States presidential election results for Minnesota”**


In [None]:
table_MN = pd.read_html('https://en.wikipedia.org/wiki/Minnesota', match='United States presidential election results for Minnesota')
print(f'Total tables: {len(table_MN)}')
table_MN

Total tables: 1


[    Year Republican         Democratic         Third party(ies)        
     Year        No.       %        No.       %              No.       %
 0   2024    1519032  46.68%    1656979  50.92%            77909   2.39%
 1   2020    1484065  45.28%    1717077  52.40%            76029   2.32%
 2   2016    1323232  44.93%    1367825  46.44%           254176   8.63%
 3   2012    1320225  44.96%    1546167  52.65%            70169   2.39%
 4   2008    1275409  43.82%    1573354  54.06%            61606   2.12%
 5   2004    1346695  47.61%    1445014  51.09%            36678   1.30%
 6   2000    1109659  45.50%    1168266  47.91%           160760   6.59%
 7   1996     766476  34.96%    1120438  51.10%           305726  13.94%
 8   1992     747841  31.85%    1020997  43.48%           579110  24.66%
 9   1988     962337  45.90%    1109471  52.91%            24982   1.19%
 10  1984    1032603  49.54%    1036364  49.72%            15482   0.74%
 11  1980     873241  42.56%     954174  46.50%    

***

# **<font color='#0969DA'>Guided Lab 343.3.4 - Exploratory Data Analysis on json data - Basic insights from the Data</font>**
---

## **Lab Overview:**

This lab focuses on performing Exploratory Data Analysis (EDA) on a JSON dataset using Python and the Pandas library. The lab aims to guide you through the following key concepts:

1. **Data Type Inspection:** Understanding the importance of checking data types for potential mismatches and compatibility with Python methods. This is demonstrated using the `dtypes` attribute of Pandas DataFrames.

2. **Descriptive Statistics:** Calculating and interpreting basic statistical measures such as mean, standard deviation, minimum, maximum, and quartiles using the `describe()` method.

3. **Concise Summary:** Obtaining a comprehensive overview of the dataset, including column names, data types, memory usage, and non-null values, using the `info()` method.

4. **Data Selection:** Extracting specific records or subsets of the data using the `head()`, `tail()`, `at`, and `iat` functions, enabling efficient exploration of large datasets.

5. **Data Shape and Size:** Determining the number of rows and columns using the `shape` attribute and exploring alternative methods like `axes` and `len` to access this information.

**Learning Outcomes:**

By the end of this lab, you should be able to:

* Confidently load and manipulate JSON data in Python using Pandas.
* Utilize various Pandas functions to perform basic EDA tasks.
* Interpret descriptive statistics and summaries to gain insights into data.
* Efficiently extract and analyze specific subsets of data.
* Understand the structure and dimensions of a dataset.

**Dataset:**

The lab utilizes a JSON dataset named ['cars.json'](https://drive.google.com/file/d/1CXAK8gbuLtc2NNOXVUgmja8fDg0TrNZm/view) as the primary data source for analysis and demonstration.


### **<font color='#0969DA'>How to check Data types in Pandas**



- In pandas, we use **dtypes** attribute to check data types.
- Why check data types?
 - potential info and type mismatch.
 - compatibility with python methods.
---
# **Begin**

The lab follows a step-by-step approach, starting with loading the JSON data into a Pandas DataFrame. It then proceeds with exploring the data's characteristics, calculating statistics, selecting specific records, and understanding the dataset's structure.

In [None]:
import pandas as pd

In [None]:
# Read JSON file
df_cars = pd.read_json('./Data/cars.json')
print(df_cars.dtypes) # check the underlying data types

Car              object
MPG             float64
Cylinders         int64
Displacement    float64
Horsepower        int64
Weight            int64
Acceleration    float64
Model             int64
Origin           object
quantity          int64
city             object
dtype: object


## **<font color='#0969DA'>Determining Descriptive Statistics**

- Pandas provides many statistical methods for DataFrames. You can get basic statistics summary for the numerical columns of a Pandas DataFrame with **describe()** method.

Visit this link for all descriptive related methods.
https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#computations-descriptive-stats

- Example: Consider the **cars.json** dataset

In [None]:

df_cars = pd.read_json('./Data/cars.json')
df_cars.describe()

Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model,quantity
count,161.0,161.0,161.0,161.0,161.0,161.0,161.0,161.0
mean,23.801863,5.347826,185.232919,100.664596,2915.093168,15.509938,76.26087,224.875776
std,8.810125,1.761607,105.394809,41.07964,890.293883,2.51578,3.818576,127.741084
min,0.0,3.0,68.0,0.0,1613.0,8.0,70.0,5.0
25%,17.0,4.0,98.0,72.0,2130.0,14.0,73.0,112.0
50%,24.0,4.0,140.0,88.0,2625.0,15.5,76.0,227.0
75%,31.0,8.0,302.0,130.0,3620.0,17.1,80.0,337.0
max,46.6,8.0,440.0,215.0,4955.0,22.1,82.0,439.0


in the above result, describe() returns a new DataFrame with the number of rows indicated by count, as well as the mean, standard deviation, minimum, maximum, and quartiles of the columns.

---



## **<font color='#0969DA'>Determine Basic Concise summary</font>**

Pandas provides many statistical methods for DataFrames. You can get basic concise summary for the Pandas DataFrame with **info()** method.

In other words, info function gives metadata of panda DataFrame, Which includes,

- Number of rows and its range of index
- Total number of columns
- List of columns
- Count of the total number of non-null values in the column
- Data type of column
- Count of columns in each data type
- Memory usage by the DataFrame

Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html

# **<font color='#0969DA'>DataFrame Count</font>**

df.count():
DataFrame Count will return the number of Non-NA values within each column. I don’t love this one because 1) it’s slower and 2) you need to do extra data work after you call .count().

Be careful, if you have NAs in your dataset, it can get confusing. The count() will skip these by default.

In [None]:
df_cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 161 entries, 0 to 160
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Car           161 non-null    object 
 1   MPG           161 non-null    float64
 2   Cylinders     161 non-null    int64  
 3   Displacement  161 non-null    float64
 4   Horsepower    161 non-null    int64  
 5   Weight        161 non-null    int64  
 6   Acceleration  161 non-null    float64
 7   Model         161 non-null    int64  
 8   Origin        161 non-null    object 
 9   quantity      161 non-null    int64  
 10  city          161 non-null    object 
dtypes: float64(3), int64(5), object(3)
memory usage: 14.0+ KB


In the above result, the information contains the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values).

---



## **<font color='#0969DA'>Select few records</font>**

The **head()** and  **tail()** functions use to select top and bottom rows of the Pandas DataFrame respectively. It is beneficial when we have massive datasets, and it is not possible to see the entire dataset at once.

**Example: Consider the cars.json dataset**

You can use **head(2)** function, only the first 2 rows of the DataFrame are displayed.

In [None]:
print(df_cars.head(2))
print(df_cars.head(-4)) # All rows minus the last 4

                   Car   MPG  Cylinders  Displacement  Horsepower  Weight  \
0       Chevrolet Vega  25.0          4         140.0          75    2542   
1  Chevrolet Vega (sw)  22.0          4         140.0          72    2408   

   Acceleration  Model Origin  quantity    city  
0          17.0     74     US       177      NJ  
1          19.0     71     US        91  DALLAS  
                            Car   MPG  Cylinders  Displacement  Horsepower  \
0                Chevrolet Vega  25.0          4         140.0          75   
1           Chevrolet Vega (sw)  22.0          4         140.0          72   
2           Chevrolet Vega 2300  28.0          4         140.0          90   
3               Chevrolet Woody  24.5          4          98.0          60   
4    Chevrolete Chevelle Malibu  16.0          6         250.0         105   
..                          ...   ...        ...           ...         ...   
152          Mercedes Benz 300d  25.4          5         183.0          



---



You can use **tail(2)** function, only the last 2 rows of the DataFrame are displayed.

In [None]:
print(df_cars.tail(2))
print(df_cars.tail(-20)) # Start at row 20

                 Car   MPG  Cylinders  Displacement  Horsepower  Weight  \
159   Mercury Lynx l  36.0          4          98.0          70    2125   
160  Mercury Marquis  11.0          8         429.0         208    4633   

     Acceleration  Model Origin  quantity   city  
159          17.3     82     US       425  TEXAS  
160          11.0     72     US       112     OH  
                         Car   MPG  Cylinders  Displacement  Horsepower  \
20             Datsun 280-ZX  32.7          6         168.0         132   
21                Datsun 310  37.2          4          86.0          65   
22             Datsun 310 GX  38.0          4          91.0          67   
23                Datsun 510  27.2          4         119.0          97   
24           Datsun 510 (sw)  28.0          4          97.0          92   
..                       ...   ...        ...           ...         ...   
156         Mercury Capri v6  21.0          6         155.0         107   
157  Mercury Cougar B



---



## **<font color='#0969DA'>Select Specific records</font>**

 Also, **at** and **iat** properties to access a specific element in the DataFrame.

Example: Using **at** property:

**Consider the cars.json dataset**



In [None]:
print(df_cars.at[157, 'MPG'])
print(df_cars.at[20, 'MPG'])


15.0
32.7


**DataFrame.iat:** We want to access a specific element from a very large DataFrame, but we do not know its column label or row index. We can still access such an element using its column and row positions. For that, we can use iat property of python pandas.

**Example: Using iat property:**
In this example, we will access the 157 row and the 1st column.

In [None]:
df_cars.iat[157, 1]

np.float64(15.0)



---



# **<font color='#0969DA'>DataFrame Shape</font>**
## **Find number of rows and columns**
The number of rows and columns of a DataFrame can be identified using the .**shape ** attribute of the Panda DataFrame. It returns a tuple (row, column) and can be indexed to get only rows, and only columns count as output.


**- df.shape[0] - To count rows**

**- df.shape[1] - To count columns**

In [None]:

print(df_cars.shape) # Get the number of rows and columns
print(df_cars.shape[0]) # Get the number of rows only
print(df_cars.shape[1]) # Get the number of columns only

(161, 11)
161
11


In [None]:
# Create DataFrame from dict
student_dict = {'Name': ['Joe', 'Nat', 'Harry'], 'Age': [20, 21, 19], 'Marks': [85.10, 77.80, 91.54]}

student_df = pd.DataFrame(student_dict)

list_Index = student_df.columns    # get col index
print(list_Index)
label = student_df.columns[0]  # 1st col label
print(student_df.columns[0])
Get_As_List = student_df.columns.tolist() # get as a list
print(Get_As_List)

Index(['Name', 'Age', 'Marks'], dtype='object')
Name
['Name', 'Age', 'Marks']




---

# **<font color='#0969DA'>DataFrame Axes Length**</font>

**len(df.axes[0]):** Next up is our most verbose option – DataFrame Axes Length.

This axes attribute will return your row axis, then you must count the length of it.

Let’s break this one down. **df.axes** will return a tuple of your two axes for rows and columns. [0] will pull the first item (rows) from the tuple. Then finally **len()** will find the length, or how many items, you have in your axis which is your row count.

 Let's look through it step by step.

- Return both axis (rows/columns)

- Pull our the rows

- Count the length

In [None]:
df_cars.axes

[RangeIndex(start=0, stop=161, step=1),
 Index(['Car', 'MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',
        'Acceleration', 'Model', 'Origin', 'quantity', 'city'],
       dtype='object')]

In [None]:
df_cars.axes[0]

RangeIndex(start=0, stop=161, step=1)

In [None]:
len(df_cars.axes[0])

161

***

# **Guided LAB 343.3.5 - Exploratory Data Analysis on CSV data - Basic insights from the Data**
---


## **Lab Overview**

This lab focuses on introducing fundamental data analysis techniques using Python's Pandas library. We'll primarily work with a CSV file containing employee data, exploring various methods to load, manipulate, and gain insights from it.

In this lab, we will demonstrate how to read a CSV file with or without a header, skip rows, skip columns, set columns to index, and many more with examples. And we will perform Exploratory Data Analysis EDA on a CSV file.

**Key Activities:**

1. **Data Loading and Initial Exploration:**  We'll begin by importing the Pandas library and utilizing the `read_csv()` function to load the employee dataset into a Pandas DataFrame. We'll then use methods like `head()`, `tail()`, and `info()` to get an initial overview of the data's structure and content.
2. **Data Handling Techniques:** We will delve into practical data handling strategies, such as skipping rows or selecting specific columns during the data loading process using parameters like `skiprows` and `usecols` with `read_csv()`.
3. **Basic Exploratory Data Analysis (EDA):** We'll perform basic EDA to understand the dataset's characteristics. This includes examining data types, identifying potential missing values, and assessing the overall shape and size of the DataFrame.

**Learning Outcomes:**

By the end of this lab, you will be proficient in:

* Importing and utilizing the Pandas library in Google Colab.
* Loading CSV data into a Pandas DataFrame.
* Applying data manipulation techniques to extract desired information.
* Performing basic EDA to gain insights from datasets.

## **Introduction:**
Use the pandas read_csv() function to read a CSV file (comma-separated) into a Python pandas DataFrame. which supports options to read any delimited file.

## **Dataset:**
In this lab we will utilize the dummy employee dataset.

[Click here to download employee dataset (employee.csv)](https://drive.google.com/file/d/14RV1xKIRzWS166LtGqnPC1Wg7eTlI_y1/view?usp=drive_link)

---

# **Begin**



**Example 1: Reading Data from CSVs**

Note: if you get error, use the line below:
`df = pd.read_csv('employee.csv', on_bad_lines='skip')`



In [None]:
import pandas as pd

In [None]:
# Check to make sure the absolute/relative file path is correct
df = pd.read_csv('./Data/employee.csv')
df


Unnamed: 0,Name,Age,Weight,Salary
0,James,36.0,75.0,5428000.0
1,Villers,38.0,74.0,3428000.0
2,VKole,31.0,70.0,8428000.0
3,Smith,34.0,80.0,4428000.0
4,Gayle,40.0,100.0,4528000.0
5,Adam,40.0,,4528000.0
6,Rooter,33.0,72.0,7028000.0
7,Peterson,42.0,85.0,2528000.0
8,lynda,42.0,85.0,
9,,42.0,85.0,


Note: Use the sep or delimiter argument to specify the separator of the columns. By default, it uses a comma.


**Example 2: Viewing or Explore your Data**


The first thing to do when opening a new dataset is to print out a few rows to keep as a visual reference. We accomplish this with .head():





In [None]:
df.head() # Show first 5 rows

Unnamed: 0,Name,Age,Weight,Salary
0,James,36.0,75.0,5428000.0
1,Villers,38.0,74.0,3428000.0
2,VKole,31.0,70.0,8428000.0
3,Smith,34.0,80.0,4428000.0
4,Gayle,40.0,100.0,4528000.0


.head() outputs the first five rows of your DataFrame by default, but we could also pass a number as well. df.head(10) would output the top ten rows.


In [None]:

df.head(10)

Unnamed: 0,Name,Age,Weight,Salary
0,James,36.0,75.0,5428000.0
1,Villers,38.0,74.0,3428000.0
2,VKole,31.0,70.0,8428000.0
3,Smith,34.0,80.0,4428000.0
4,Gayle,40.0,100.0,4528000.0
5,Adam,40.0,,4528000.0
6,Rooter,33.0,72.0,7028000.0
7,Peterson,42.0,85.0,2528000.0
8,lynda,42.0,85.0,
9,,42.0,85.0,



To see the last five rows, use df.tail(), which also accepts a number and prints the bottom two rows in this case.



In [None]:
df.tail(2)

Unnamed: 0,Name,Age,Weight,Salary
13,John,41.0,85.0,1528000.0
14,Ali,26.0,69.0,


**Example 3: Getting Information About your Data**

.info() should be one of the very first commands you run after loading your data:




In [None]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    14 non-null     object 
 1   Age     12 non-null     float64
 2   Weight  14 non-null     float64
 3   Salary  12 non-null     float64
dtypes: float64(3), object(1)
memory usage: 612.0+ bytes


**.info()** provides the essential details about your dataset such as the number of rows and columns, the number of non-null values, what type of data is in each column, and how much memory your DataFrame is using.

Another fast and useful attribute is .shape, which returns just a tuple of (rows, columns):









In [None]:
df.shape


(15, 4)

Note that .shape has no parentheses and is a simple tuple format (rows, columns). So, we have 15 rows and 4 columns in our employeeDataFrame.

**Example 4: Skip Rows**

Sometimes, you may need to skip the first row or skip the footer rows. To do this, use skiprows and skipfooter params, respectively.


   


In [None]:

# Skip first 5 rows including headers
df_skip5 = pd.read_csv('./Data/employee.csv', header=None, skiprows=5)

# Skip first 5 rows after headers
df_skip5_exclude = pd.read_csv('./Data/employee.csv', header=0, names=['name', 'age', 'weight', 'salary'], skiprows=5)
print(df_skip5)
print()
print(df_skip5_exclude)

           0     1      2          3
0      Gayle  40.0  100.0  4528000.0
1       Adam  40.0    NaN  4528000.0
2     Rooter  33.0   72.0  7028000.0
3   Peterson  42.0   85.0  2528000.0
4      lynda  42.0   85.0        NaN
5        NaN  42.0   85.0        NaN
6      Jenny   NaN  100.0    25632.0
7       Kenn   NaN  110.0    25632.0
8        Aly   NaN   90.0    25582.0
9       John  41.0   85.0  1528000.0
10       Ali  26.0   69.0        NaN

       name   age  weight     salary
0      Adam  40.0     NaN  4528000.0
1    Rooter  33.0    72.0  7028000.0
2  Peterson  42.0    85.0  2528000.0
3     lynda  42.0    85.0        NaN
4       NaN  42.0    85.0        NaN
5     Jenny   NaN   100.0    25632.0
6      Kenn   NaN   110.0    25632.0
7       Aly   NaN    90.0    25582.0
8      John  41.0    85.0  1528000.0
9       Ali  26.0    69.0        NaN


**Example 5: Load Only Selected Columns**

There are two common ways to use this argument:

**Method 1:** Use usecols with Column Names

Syntax:
`df = pd.read_csv('my_data.csv', usecols=['column name one', 'column name two'])`

**Method 2:** Use usecols with Column Positions

Syntax:
`df = pd.read_csv('my_data.csv', usecols=[0, 2])`






In [None]:
pd.read_csv('./Data/employee.csv', usecols=['Name', 'Salary'])

Unnamed: 0,Name,Salary
0,James,5428000.0
1,Villers,3428000.0
2,VKole,8428000.0
3,Smith,4428000.0
4,Gayle,4528000.0
5,Adam,4528000.0
6,Rooter,7028000.0
7,Peterson,2528000.0
8,lynda,
9,,


In [None]:
pd.read_csv('./Data/employee.csv', usecols =[0,3])


Unnamed: 0,Name,Salary
0,James,5428000.0
1,Villers,3428000.0
2,VKole,8428000.0
3,Smith,4428000.0
4,Gayle,4528000.0
5,Adam,4528000.0
6,Rooter,7028000.0
7,Peterson,2528000.0
8,lynda,
9,,


***

# **Guided Lab 343.3.6 - Selecting Columns in Pandas DataFrames**

---



## **Learning Objective:**
This lab focuses on accessing and selecting specific columns from Pandas DataFrames, a fundamental skill in data manipulation and analysis.

By the end this lab, learners will be able to Select any specified column or columns from Pandas Dataframe.

**Introduction:**

Pandas DataFrames provide flexible ways to select columns, including:

1. **Column Attribute:** Use square brackets `[]` with the column name for single column selection or a list of column names for multiple columns.
2. **Column Index Number:** Access columns using their index number within the DataFrame.

**Lab Activities:**

1. **Import Dataset:** Begin by importing the `cars.json` dataset into a Pandas DataFrame named `df_cars`.
2. **Select Single Column:** Extract the 'Car' column to demonstrate single column selection.
3. **Select Multiple Columns:** Select 'Car', 'Model', and 'quantity' columns to illustrate multiple column selection.
4. **Select by Column Index:** Practice selecting columns using their index number.
5. **Continue Learning:** This lab serves as an introduction. Further exploration of column selection techniques will be covered later in the module.



## **Introduction:**
**Selecting columns by using column attribute**

- To select a single column, use square brackets [ ] with the column name of the column of interest.

- To select multiple columns, use a list of column names within the selection brackets [ ].

**Syntax:**

```
# select column to Series
s = df['colName']

# select column to dataframe
df = df[['colName']]

# select two or more column
df = df[['ColOne','colTwo']]  

# select column by column index number
s = df[df.columns[0]]  

# select columns by column index numbers
df = df.columns[[0, 3, 4]]
```





---

## **Dataset:**

The lab utilizes the [`cars.json`](https://drive.google.com/file/d/1CXAK8gbuLtc2NNOXVUgmja8fDg0TrNZm/view) dataset, providing a practical context for applying column selection methods.

 Let import cars dataset in Panda dataframe.

In [None]:
import pandas as pd

In [None]:

df_cars = pd.read_json('./Data/cars.json')
df_cars


Unnamed: 0,Car,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model,Origin,quantity,city
0,Chevrolet Vega,25.0,4,140.0,75,2542,17.0,74,US,177,NJ
1,Chevrolet Vega (sw),22.0,4,140.0,72,2408,19.0,71,US,91,DALLAS
2,Chevrolet Vega 2300,28.0,4,140.0,90,2264,15.5,71,US,74,TEXAS
3,Chevrolet Woody,24.5,4,98.0,60,2164,22.1,76,US,241,OH
4,Chevrolete Chevelle Malibu,16.0,6,250.0,105,3897,18.5,75,US,206,NewYork
...,...,...,...,...,...,...,...,...,...,...,...
156,Mercury Capri v6,21.0,6,155.0,107,2472,14.0,73,US,158,NewYork
157,Mercury Cougar Brougham,15.0,8,302.0,130,4295,14.9,77,US,27,NJ
158,Mercury Grand Marquis,16.5,8,351.0,138,3955,13.2,79,US,332,DALLAS
159,Mercury Lynx l,36.0,4,98.0,70,2125,17.3,82,US,425,TEXAS


## **Example - Select One Column:**

Suppose we are interested in the Name of the cars.

In [None]:
df_cars['Car']

0                  Chevrolet Vega
1             Chevrolet Vega (sw)
2             Chevrolet Vega 2300
3                 Chevrolet Woody
4      Chevrolete Chevelle Malibu
                  ...            
156              Mercury Capri v6
157       Mercury Cougar Brougham
158         Mercury Grand Marquis
159                Mercury Lynx l
160               Mercury Marquis
Name: Car, Length: 161, dtype: object

In [None]:
df_cars['Car'].head(10)

0                Chevrolet Vega
1           Chevrolet Vega (sw)
2           Chevrolet Vega 2300
3               Chevrolet Woody
4    Chevrolete Chevelle Malibu
5                     Chevy C20
6                    Chevy S-10
7              Chrysler Cordoba
8    Chrysler Lebaron Medallion
9        Chrysler Lebaron Salon
Name: Car, dtype: object



---



## **Example: Select multiple Columns**

Suppose, we are interested in the cars name, Model of the cars, and quantity

In [None]:
print(df_cars[['Car','Model', 'quantity']].head(4))
print()
print(df_cars[['Horsepower','Weight']].head(4))


                   Car  Model  quantity
0       Chevrolet Vega     74       177
1  Chevrolet Vega (sw)     71        91
2  Chevrolet Vega 2300     71        74
3      Chevrolet Woody     76       241

   Horsepower  Weight
0          75    2542
1          72    2408
2          90    2264
3          60    2164


## **Example: Select single column by column index number**

In [None]:
print("Column 1:")
print(df_cars[df_cars.columns[0]].head(5), '\n')

print("Column 3:")
print(df_cars[df_cars.columns[2]].head(5), '\n')


Column 1:
0                Chevrolet Vega
1           Chevrolet Vega (sw)
2           Chevrolet Vega 2300
3               Chevrolet Woody
4    Chevrolete Chevelle Malibu
Name: Car, dtype: object 

Column 3:
0    4
1    4
2    4
3    4
4    6
Name: Cylinders, dtype: int64 



## **Example: Select Multiple columns by column index number**

In [None]:
print(df_cars[df_cars.columns[[0,1,9]]].head(2))
print()
print(df_cars[df_cars.columns[[7, 3, 5]]].head(2))

                   Car   MPG  quantity
0       Chevrolet Vega  25.0       177
1  Chevrolet Vega (sw)  22.0        91

   Model  Displacement  Weight
0     74         140.0    2542
1     71         140.0    2408




---



# **Guided Lab 343.3.7 - Count the occurrences of unique values from column**

## **Learning Objective:**
In this lab, you will demonstrate how to count the occurrences of unique values from column on Pandas DataFrame, we will use value_count() function for that.

By the end of this lab, learners will be able to:
- Use value_count() function to count the occurrences of unique values from column on Pandas DataFrame.
- Analyze the quick summary of the unique values and their frequencies, making it a valuable tool in data exploration and analysis.







## **Introduction**:
The **`value_counts()`** function in pandas is used to count the occurrences of unique values in a Series (a single column of a DataFrame). It returns a pandas Series with the unique values as the index and the counts of each unique value as the corresponding values in the Series.

Here are some common use cases for value_counts():

# **Example 1: Frequency Analysis:**

To understand the distribution of values in a categorical variable.
Example: Count the number of occurrences of each category in a column representing product categories, customer segments, etc.

In [None]:
# import pandas library to the codespace
import pandas as pd

In [None]:
# Sample DataFrame
data = {'Category': ['Electronics', 'Clothing', 'Electronics', 'Books', 'Books', 'Clothing']}
df = pd.DataFrame(data)

# Count the occurrences of each category
category_counts = df['Category'].value_counts()
print(category_counts)

Category
Electronics    2
Clothing       2
Books          2
Name: count, dtype: int64


## **Example 2: Frequency Analysis:**
The example below shows the frequency of items in the column ‘Brand’ and prints the item that occurs more than once in the same column.

In [None]:
# Create the data source
sales_data = {"Devices":['Laptop','iPhone','LED','LCD','Smart-Phone','Washing-Machine'],
           'Brand':['Lenovo','Apple','Samsung','Samsung','Samsung','Whirpool'],
           'Sales':[1000,2000,4000,2000,1000,4000],
           'Profit':[500,1000,1000,1500,1000,1500],
           'Pices left':[5000,4000,4000,5000,5000,1000]}
# create the pandas dataframe
df = pd.DataFrame(sales_data)
print(df)
# Frequency of items in the column brand
Frequency = (df.Brand.value_counts())
print("Print Frequency:\n", Frequency)
# print the product that occurs more than once
Frequent_product = Frequency[Frequency > 1].index[0]
print("Frequent Products:\n", Frequent_product)
# display the items along with their frequency
display(Frequency)
# print the item that occurs more than once
print(" This item appears more than once:",Frequent_product)

           Devices     Brand  Sales  Profit  Pices left
0           Laptop    Lenovo   1000     500        5000
1           iPhone     Apple   2000    1000        4000
2              LED   Samsung   4000    1000        4000
3              LCD   Samsung   2000    1500        5000
4      Smart-Phone   Samsung   1000    1000        5000
5  Washing-Machine  Whirpool   4000    1500        1000
Print Frequency:
 Brand
Samsung     3
Lenovo      1
Apple       1
Whirpool    1
Name: count, dtype: int64
Frequent Products:
 Samsung


Brand
Samsung     3
Lenovo      1
Apple       1
Whirpool    1
Name: count, dtype: int64

 This item appears more than once: Samsung


## **Example 3: Checking Missing Values:**

To quickly identify missing values in a column.
Example: Count the occurrences of each value in a column and check for any values that stand out, such as 0 or -1.

In [None]:
# Sample DataFrame with missing values
data = {'Score': [85, 92, 88, 75, None, 90, None, 85]}
df = pd.DataFrame(data)

# Count the occurrences of each value in the 'Score' column
# dropna=True Don’t include counts of rows containing NA values
score_counts = df['Score'].value_counts(dropna=False) # dropna=False include counts of rows containing NA values
print(score_counts)


Score
85.0    2
NaN     2
92.0    1
88.0    1
75.0    1
90.0    1
Name: count, dtype: int64


## **Example 4: Checking Data Quality:**

To quickly identify potential issues with data quality.
Example: Identify unexpected or outlier values in a numerical column

In [None]:
# Sample DataFrame
data = {'Age': [25, 30, 25, 35, 25, 40, 25, 30, 45, 25, 30, 25]}
df = pd.DataFrame(data)

# Count the occurrences of each age
age_counts = df['Age'].value_counts()
age_counts2 = df['Age'].value_counts(normalize=True) # normaliz=True returns proportion(percentages)
print(age_counts)
print('Percentage:')
print(age_counts2)


Age
25    6
30    3
35    1
40    1
45    1
Name: count, dtype: int64
Percentage:
Age
25    0.500000
30    0.250000
35    0.083333
40    0.083333
45    0.083333
Name: proportion, dtype: float64


***

# **Guided Lab - 343.3.8 - How to Convert Pandas Column to List**

## **Learning Objective:**
In this Lab, will learn and demonstrate how to convert a Pandas column from a DataFrame to a list using the **.tolist()** method on the specific column.

Upon completing this lab, you should be able to:

- Extract a specific column from a Pandas DataFrame.
- Convert a Pandas Series (column) into a Python list using various methods.
- Describe the relationship between Pandas Series, NumPy arrays, and Python lists.
- Access columns using both column names and index positions.
Extract the DataFrame's index as a list.

You can get or convert the pandas DataFrame column to the list using Series.values.tolist(), since each column in DataFrame is represented as a Series internally, you can use this function after getting a column you wanted to convert as a Series. You can get a column as a Series by using **df.column_name** or  **df['column_name']**.

## **Dataset**

In this lab, we will use [cars.json](https://drive.google.com/file/d/1CXAK8gbuLtc2NNOXVUgmja8fDg0TrNZm/view) dataset, Let import cars dataset in Panda dataframe.

## **Instructions:**

In [None]:
import pandas as pd

In [None]:
# do not forget to change the path of the file
df_cars = pd.read_json('./Data/cars.json')
df_cars.head(4)

Unnamed: 0,Car,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model,Origin,quantity,city
0,Chevrolet Vega,25.0,4,140.0,75,2542,17.0,74,US,177,NJ
1,Chevrolet Vega (sw),22.0,4,140.0,72,2408,19.0,71,US,91,DALLAS
2,Chevrolet Vega 2300,28.0,4,140.0,90,2264,15.5,71,US,74,TEXAS
3,Chevrolet Woody,24.5,4,98.0,60,2164,22.1,76,US,241,OH


### **Examples 1: Converting Pandas column to list using Series.values.tolist()**

In [None]:
# Example 1: Using Series.values.tolist()
# in the below example, we are getting the column 'Car' of the dataframe
col_car_list = df_cars.Car.values.tolist()
print(col_car_list)

# getting the column 'Model' of the dataframe
col_model_list = df_cars.Model.values.tolist()
print(col_model_list)


['Chevrolet Vega', 'Chevrolet Vega (sw)', 'Chevrolet Vega 2300', 'Chevrolet Woody', 'Chevrolete Chevelle Malibu', 'Chevy C20', 'Chevy S-10', 'Chrysler Cordoba', 'Chrysler Lebaron Medallion', 'Chrysler Lebaron Salon', 'Chrysler Lebaron Town @ Country (sw)', 'Chrysler New Yorker Brougham', 'Chrysler Newport Royal', 'Citroen DS-21 Pallas', 'Datsun 1200', 'Datsun 200SX', 'Datsun 200-SX', 'Datsun 210', 'Datsun 210', 'Datsun 210 MPG', 'Datsun 280-ZX', 'Datsun 310', 'Datsun 310 GX', 'Datsun 510', 'Datsun 510 (sw)', 'Datsun 510 Hatchback', 'Datsun 610', 'Datsun 710', 'Datsun 710', 'Datsun 810', 'Datsun 810 Maxima', 'Datsun B210', 'Datsun B-210', 'Datsun B210 GX', 'Datsun F-10 Hatchback', 'Datsun PL510', 'Datsun PL510', 'Dodge Aries SE', 'Dodge Aries Wagon (sw)', 'Dodge Aspen', 'Dodge Aspen', 'Dodge Aspen 6', 'Dodge Aspen SE', 'Dodge Challenger SE', 'Dodge Charger 2.2', 'Dodge Colt', 'Dodge Colt', 'Dodge Colt', 'Dodge Colt (sw)', 'Dodge Colt Hardtop', 'Dodge Colt Hatchback Custom', 'Dodge Colt 

**Alternatively, you can also write the statement using.**

In [None]:
# in the below example, we are getting the index of the dataframe
df_cars["Car"].values.tolist()

Below is an explanation of each section of the statement.

- df_cars['Courses'] returns a Series object of a specified column.

- df_cars['Courses'].values returns an array with column values and this has a helper function .tolist() to convert to a list.

# **Example 2: Converting Pandas column to list using list() Function**

list() function will return the list with the values of the specified column of DataFrame.

In [None]:
print(list(df_cars["Car"].head(2)))
print(list(df_cars["Model"].head(2)))

['Chevrolet Vega', 'Chevrolet Vega (sw)']
[74, 71]


# **Example 3: Convert Pandas column to Numpy array**

Sometimes you may be required to convert the Pandas column to Numpy Array you can do so by using the **to_numpy()** function.

In [None]:

col_list = df_cars['Car'].head(4).to_numpy()
print(col_list)
print("Get the array from specified column:\n")
print(type(col_list))



['Chevrolet Vega' 'Chevrolet Vega (sw)' 'Chevrolet Vega 2300'
 'Chevrolet Woody']
Get the array from specified column:

<class 'numpy.ndarray'>


# **Example 4: Converting Pandas column to list using by Column Index**

If you have a column index and want to get the column values of an index, first you have to get the column name by index and then use the approaches explained above to convert it to a list.

From the below example **`df.columns[0]`** returns the column name for an index 0, which is Car.


In [None]:
# in the below example, we are getting the column names of the dataframe
col_list = df_cars[df_cars.columns[0]].head(4).values.tolist()
print(col_list)
print(type(col_list))



['Chevrolet Vega', 'Chevrolet Vega (sw)', 'Chevrolet Vega 2300', 'Chevrolet Woody']
<class 'list'>


# **Example 5: Convert Index Column to List**


In [None]:
# in the below example, we are getting the index of the dataframe
index_list = df_cars.head().index.tolist()
print(index_list)
print(len(index_list))

[0, 1, 2, 3, 4]
5


***

# **Guided Lab - 343.3.9 - Adding new column to existing DataFrame in Pandas**


## **Lab Objective:**
- In this lab, we will cover how to add/append multiple columns and add a constant value, deriving new columns from an existing column to the Pandas DataFrame.

- By the end of this lab, learners will be able to add new columns to Panda Dataframe using several approaches.

## **Introduction:**
In Pandas, a DataFrame represents a two-dimensional, heterogenous, tabular data structure with labeled rows and columns (axes). In simple words, it contains three components ― data, rows, columns.

Let's discuss how to add new columns to the existing DataFrame in Pandas. There are multiple ways we can do this task.

## **Instructions**


## **Method #1: By Dictionary and declaring a new list as a column.**

**Example 1.1:**

Note that the length of your list should match the length of the index column; otherwise, it will show an error.
import pandas as pd
  




In [None]:
import pandas as pd

In [None]:
# Define a dictionary containing Students data and their respective details
data = {'Name': ['Jane', 'Princi', 'James', 'Fadi'],
        'Height': [5.1, 6.2, 5.1, 5.2],
        'Qualification': ['Msc', 'MA', 'Msc', 'Msc'],
        'Score 1' : [56,86,77,45],
        'Score 2' : [50,96,60,30]}

# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
print("------before -------")
print(df)


# Declare a list that is to be converted into a column
address = ['NYC', 'NJ', 'CA', 'PA']

# Using 'Address' as the column name
# and equating it to the list
df['Address'] = address
print("------after adding column -------")
# Observe the result
print(df)

# Declare list to convert to a column
gender = ['F', 'F', 'M', 'F']

# Use 'Gender' as column name & add to df
df['Gender'] = gender
print("------add another column-------")
print(df)

------before -------
     Name  Height Qualification  Score 1  Score 2
0    Jane     5.1           Msc       56       50
1  Princi     6.2            MA       86       96
2   James     5.1           Msc       77       60
3    Fadi     5.2           Msc       45       30
------after adding column -------
     Name  Height Qualification  Score 1  Score 2 Address
0    Jane     5.1           Msc       56       50     NYC
1  Princi     6.2            MA       86       96      NJ
2   James     5.1           Msc       77       60      CA
3    Fadi     5.2           Msc       45       30      PA
------add another column-------
     Name  Height Qualification  Score 1  Score 2 Address Gender
0    Jane     5.1           Msc       56       50     NYC      F
1  Princi     6.2            MA       86       96      NJ      F
2   James     5.1           Msc       77       60      CA      M
3    Fadi     5.2           Msc       45       30      PA      F


**Example: 1.2**

- You can apply basic arithmetic operations such as addition, subtraction, multiplication, and division to Pandas DataFrame objects the same way you would with NumPy arrays:

- The below example adds a new column based on a calculation of the existing column.





In [None]:
# defining a dictionary containing Students data and their respective details
data = {'Name': ['Jane', 'Princi', 'James', 'Fadi'],
        'Height': [5.1, 6.2, 5.1, 5.2],
        'Qualification': ['Msc', 'MA', 'Msc', 'Msc'],
       'Score 1' : [56,86,77,45],
           'Score 2' : [50,96,60,30]}

# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
print("------before -------")
print(df)
print("------after adding column -------")
df['Total_Score'] = df['Score 1'] +  df['Score 2'] # add columns & store/add in another column

print(df)

------before -------
     Name  Height Qualification  Score 1  Score 2
0    Jane     5.1           Msc       56       50
1  Princi     6.2            MA       86       96
2   James     5.1           Msc       77       60
3    Fadi     5.2           Msc       45       30
------after adding column -------
     Name  Height Qualification  Score 1  Score 2  Total_Score
0    Jane     5.1           Msc       56       50          106
1  Princi     6.2            MA       86       96          182
2   James     5.1           Msc       77       60          137
3    Fadi     5.2           Msc       45       30           75


**Example: 1.3**

In [None]:
data = {'Name': ['Jane', 'Princi', 'James', 'Fadi'],
        'Height': [5.1, 6.2, 5.1, 5.2],
        'Qualification': ['Msc', 'MA', 'Msc', 'Msc'],
       'Score 1' : [56,86,77,45],
           'Score 2' : [50,96,60,30]}

# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
print("------before -------")
print(df)
print("------after adding column -------")
#
df['Total_Score'] = df['Score 1'] +  df['Score 2']
df['Total_Score_average'] = df['Total_Score']/ 2
print(df)
df


------before -------
     Name  Height Qualification  Score 1  Score 2
0    Jane     5.1           Msc       56       50
1  Princi     6.2            MA       86       96
2   James     5.1           Msc       77       60
3    Fadi     5.2           Msc       45       30
------after adding column -------
     Name  Height Qualification  Score 1  Score 2  Total_Score  \
0    Jane     5.1           Msc       56       50          106   
1  Princi     6.2            MA       86       96          182   
2   James     5.1           Msc       77       60          137   
3    Fadi     5.2           Msc       45       30           75   

   Total_Score_average  
0                 53.0  
1                 91.0  
2                 68.5  
3                 37.5  


Unnamed: 0,Name,Height,Qualification,Score 1,Score 2,Total_Score,Total_Score_average
0,Jane,5.1,Msc,56,50,106,53.0
1,Princi,6.2,MA,86,96,182,91.0
2,James,5.1,Msc,77,60,137,68.5
3,Fadi,5.2,Msc,45,30,75,37.5


**Example: 1.4**

In the below example, we will rearrange columns and skip the height column.



In [None]:
data = {'Name': ['Jane', 'Princi', 'James', 'Fadi'],
        'Height': [5.1, 6.2, 5.1, 5.2],
        'Qualification': ['Msc', 'MA', 'Msc', 'Msc'],
       'Score 1' : [56,86,77,45],
       'Score 2' : [50,96,60,30]
       }

# Convert the dictionary into DataFrame / skipping 'Height' column
df = pd.DataFrame(data,columns = ['Score 1','Score 2','Name', 'Qualification'  ])
print("------before -------")
print(df)
print("------after adding column -------")

df['Total_Score'] = df['Score 1'] +  df['Score 2']
print(df)


------before -------
   Score 1  Score 2    Name Qualification
0       56       50    Jane           Msc
1       86       96  Princi            MA
2       77       60   James           Msc
3       45       30    Fadi           Msc
------after adding column -------
   Score 1  Score 2    Name Qualification  Total_Score
0       56       50    Jane           Msc          106
1       86       96  Princi            MA          182
2       77       60   James           Msc          137
3       45       30    Fadi           Msc           75


Example 1.5


In [None]:
# creating and initializing a list
values = [['Rohan', 455], ['Elvish', 250], ['John', 495],
          ['Sai', 400], ['Eric', 350], ['Adam', 450]]

# creating a pandas dataframe
df = pd.DataFrame(values, columns=['Name', 'Univ_Marks'])

# displaying the data frame

print('Data frame before calculating percentage\n')
print(df)
print('\nData frame with Percentage Column\n')

# Creating new column Percentage derived from Univ_Marks
df["Percentage"] = df["Univ_Marks"]/500*100

# displaying the data frame
print(df)

Data frame before calculating percentage

     Name  Univ_Marks
0   Rohan         455
1  Elvish         250
2    John         495
3     Sai         400
4    Eric         350
5    Adam         450

Data frame with Percentage Column

     Name  Univ_Marks  Percentage
0   Rohan         455        91.0
1  Elvish         250        50.0
2    John         495        99.0
3     Sai         400        80.0
4    Eric         350        70.0
5    Adam         450        90.0


## **Method #2: By using DataFrame.insert() function**

It gives the freedom to add a column at any position we like, and not just at the end. It also provides different options for inserting the column values.

**Example 2.1**




In [None]:
# Define a dictionary containing Students data
data = {'Name': ['Jane', 'Princi', 'James', 'Fadi'],
        'Height': [5.1, 6.2, 5.1, 5.2],
        'Qualification': ['Msc', 'MA', 'Msc', 'Msc'],
       'Score 1' : [56,86,77,45],
           'Score 2' : [50,96,60,30]}

# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
print("------before -------")
print(df)

# Using DataFrame.insert() adding age column
#     index,   col, value/content,  allow duplicates
df.insert(2, "Age", [21, 23, 24, 21], True)
print("------after adding column -------")
# Observe the result
print(df)

------before -------
     Name  Height Qualification  Score 1  Score 2
0    Jane     5.1           Msc       56       50
1  Princi     6.2            MA       86       96
2   James     5.1           Msc       77       60
3    Fadi     5.2           Msc       45       30
------after adding column -------
     Name  Height  Age Qualification  Score 1  Score 2
0    Jane     5.1   21           Msc       56       50
1  Princi     6.2   23            MA       86       96
2   James     5.1   24           Msc       77       60
3    Fadi     5.2   21           Msc       45       30


## **Method #3: Using Dataframe.assign() function**

**Example 3.1**

This method will create a new dataframe with a new column added to the old dataframe.

**Returns a new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten**

In [None]:
# Define a dictionary containing Students data
data = {'Name': ['Jane', 'Princi', 'James', 'Fadi'],
        'Height': [5.1, 6.2, 5.1, 5.2],
        'Qualification': ['Msc', 'MA', 'Msc', 'Msc'],
       'Score 1' : [56,86,77,45],
           'Score 2' : [50,96,60,30]}

print("------before -------")

# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
print(df)
print("------after adding column -------")
# using DataFrame.assign() method adding 'Address' as the column name and adding it to the list
df = df.assign(address = ['NYC', 'NJ', 'CA', 'PA'])

# Observe the result
print(df)

------before -------
     Name  Height Qualification  Score 1  Score 2
0    Jane     5.1           Msc       56       50
1  Princi     6.2            MA       86       96
2   James     5.1           Msc       77       60
3    Fadi     5.2           Msc       45       30
------after adding column -------
     Name  Height Qualification  Score 1  Score 2 address
0    Jane     5.1           Msc       56       50     NYC
1  Princi     6.2            MA       86       96      NJ
2   James     5.1           Msc       77       60      CA
3    Fadi     5.2           Msc       45       30      PA


In [None]:
df_temp = pd.DataFrame({'temp_c': [17.0, 25.0]},
                        index=['South Carolina', 'Florida'])
print(df_temp)
print('\nAssigning column w/ Lambda:')
# value is a callable, evaluated on df using lambda
print(df_temp.assign(temp_f = lambda x: x.temp_c * 9 / 5 + 32))

                temp_c
South Carolina    17.0
Florida           25.0

Assigning column w/ Lambda:
                temp_c  temp_f
South Carolina    17.0    62.6
Florida           25.0    77.0


***