# This is Assignment 11 - Thienkim Le
## Requirement 3:
In your own words, describe the commonalities and differences between
NumPy arrays and Pandas series.

Commonalities:
* Both built in Python, providing efficient performance for numerical tasks.
* Support indexing and slicing operations for accessing and modifying data.
* Allow various mathematical and statistical operations.

Differences:

* NumPy arrays can have multiple dimensions, while Pandas Series are strictly one-dimensional.
* NumPy arrays require homogeneous data types, whereas Pandas Series can accommodate mixed data types.
* NumPy arrays use positional integer-based indexing, while Pandas Series allow custom labeled indexes for intuitive access.
* Pandas Series offer tailored functionality for data analysis, such as handling missing data and time series.
* Pandas Series provide more flexibility with built-in methods for tasks like grouping and filtering.
* NumPy arrays are foundational for scientific computing, while Pandas Series are often used alongside DataFrames for comprehensive data analysis.

In summary, NumPy arrays excel in numerical computations and multi-dimensional data, while Pandas Series are optimized for labeled one-dimensional data analysis. Choosing between them depends on the specific task requirements.

Below are examples of creating NumPy arrays and Pandas Series, along with some basic operations:

NumPy Example:

In [8]:
import numpy as np

# Creating a NumPy array with multiple dimensions
numpy_array = np.array([[1, 2, 3], [4, 5, 6]])

print("NumPy Array:")
print(numpy_array)

# NumPy array indexing
print("NumPy Array - Accessing element at position (1, 2):", numpy_array[1, 2])

# NumPy array for numerical computations
numpy_array = np.array([1, 2, 3, 4, 5])

print("Sum of elements in NumPy array:", np.sum(numpy_array))
print("Mean of elements in NumPy array:", np.mean(numpy_array))

NumPy Array:
[[1 2 3]
 [4 5 6]]
NumPy Array - Accessing element at position (1, 2): 6
Sum of elements in NumPy array: 15
Mean of elements in NumPy array: 3.0


Pandas Example:

In [11]:
import pandas as pd
import numpy as np

# Creating a Pandas Series with mixed data types
pandas_series = pd.Series([1, 'two', 3.0, True])

print("Pandas Series:")
print(list(pandas_series))

# Pandas Series with custom labeled index
pandas_series = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])

print("Pandas Series - Accessing element with label 'b':", pandas_series['b'])

# Handling missing data in Pandas Series
pandas_series_with_missing = pd.Series([1, 2, np.nan, 4])

print("Pandas Series with missing data:")
print(list(pandas_series_with_missing))

# Handling time series data in Pandas Series
time_index = pd.date_range('2024-01-01', periods=5)
pandas_time_series = pd.Series([1, 2, 3, 4, 5], index=time_index)

print("Pandas Time Series:")
print(list(pandas_time_series))

# Grouping data in Pandas Series
pandas_series_grouped = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
grouped = pandas_series_grouped.groupby(lambda x: 'Even' if x % 2 == 0 else 'Odd').sum()

print("Grouped Pandas Series:")
print(list(grouped))

# Filtering data in Pandas Series
filtered_series = pandas_series_grouped[pandas_series_grouped > 5]

print("Filtered Pandas Series (values greater than 5):")
print(list(filtered_series))

# Creating a Pandas DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35]}
df = pd.DataFrame(data)

print("Pandas DataFrame:")
print(df)

# Performing data analysis with Pandas DataFrame
print("Mean age:", df['Age'].mean())


Pandas Series:
[1, 'two', 3.0, True]
Pandas Series - Accessing element with label 'b': 2
Pandas Series with missing data:
[1.0, 2.0, nan, 4.0]
Pandas Time Series:
[1, 2, 3, 4, 5]
Grouped Pandas Series:
[25, 30]
Filtered Pandas Series (values greater than 5):
[6, 7, 8, 9, 10]
Pandas DataFrame:
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
Mean age: 30.0


## Requirement 4
Create a Pandas series named bill_names with American currency bill
denominations as indices (keys) and President last names as values. For
example: 1 – Washington, 2 – Jefferson, etc.

In [13]:
import pandas as pd

# Create a dictionary with American currency bill denominations as keys and President last names as values
bill_names_data = {
    1: 'Washington',
    2: 'Jefferson',
    5: 'Lincoln',
    10: 'Hamilton',
    20: 'Jackson',
    50: 'Grant',
    100: 'Franklin'
}

# Create the Pandas Series
bill_names = pd.Series(bill_names_data)

# Print the Pandas Series
print("Pandas Series - bill_names:")
print(list(bill_names))


Pandas Series - bill_names:
['Washington', 'Jefferson', 'Lincoln', 'Hamilton', 'Jackson', 'Grant', 'Franklin']


## Requirement 5
Create a Pandas series-as-dictionary beverages_dict. Make the indices
beverage names and comments about the beverage as values. For example:
‘Aquafina’: ‘This is my favorite bottled water!’. Demonstrate accessing:
* individual values via the keys
* multiple values via slicing

In [14]:
import pandas as pd

# Create a dictionary with beverage names as keys and comments as values
beverages_data = {
    'Aquafina': 'This is my favorite bottled water!',
    'Coca-Cola': 'Classic soda with a refreshing taste.',
    'Coffee': 'Helps me wake up in the morning.',
    'Orange Juice': 'Freshly squeezed orange juice is the best!',
    'Green Tea': 'A calming and healthy beverage option.',
    'Milk': 'Great source of calcium for strong bones.'
}

# Create the Pandas Series
beverages_dict = pd.Series(beverages_data)

# Accessing individual values via keys
print("Comment about Aquafina:", beverages_dict['Aquafina'])
print("Comment about Coffee:", beverages_dict['Coffee'])

# Accessing multiple values via slicing
print("\nComments about the first three beverages:")
print(beverages_dict[:3])


Comment about Aquafina: This is my favorite bottled water!
Comment about Coffee: Helps me wake up in the morning.

Comments about the first three beverages:
Aquafina        This is my favorite bottled water!
Coca-Cola    Classic soda with a refreshing taste.
Coffee            Helps me wake up in the morning.
dtype: object


## Requirement 6
Obtain the real city proper population data for the following cities:
Chongquing, Shanghai, Tokyo, Moscow, Mexico City, London, & New York.
Create a series-as-dictionary named population based on this information.

In [15]:
import pandas as pd

# Create a dictionary with city names as keys and population as values
population_data = {
    'Chongqing': 30000000,
    'Shanghai': 27000000,
    'Tokyo': 38000000,
    'Moscow': 12000000,
    'Mexico City': 21000000,
    'London': 9000000,
    'New York': 20000000
}

# Create the Pandas Series
population = pd.Series(population_data)

# Print the Pandas Series
print("Population Series:")
print(population)


Population Series:
Chongqing      30000000
Shanghai       27000000
Tokyo          38000000
Moscow         12000000
Mexico City    21000000
London          9000000
New York       20000000
dtype: int64


## Requirement 7
Create a Pandas series-as-dictionary named city_country with the cities in
population and the countries for each city.

In [16]:
import pandas as pd

# Dictionary containing city-country mappings
city_country_data = {
    'Chongqing': 'China',
    'Shanghai': 'China',
    'Tokyo': 'Japan',
    'Moscow': 'Russia',
    'Mexico City': 'Mexico',
    'London': 'United Kingdom',
    'New York': 'United States'
}

# Create the Pandas Series
city_country = pd.Series(city_country_data)

# Print the Pandas Series
print("City-Country Series:")
print(city_country)


City-Country Series:
Chongqing               China
Shanghai                China
Tokyo                   Japan
Moscow                 Russia
Mexico City            Mexico
London         United Kingdom
New York        United States
dtype: object


## Requirement 8
Create a Pandas dataframe object named city_dataframe from a dictionary
of series objects using population and city_country. Show:
* the .index property
* the .columns property
* the .keys() method

In [20]:
import pandas as pd

# Dictionary of Series objects
data = {
    'Population': population,
    'Country': city_country
}

# Create the Pandas DataFrame
city_dataframe = pd.DataFrame(data)

# Show the .index property
print("Index of the DataFrame:")
print(list(city_dataframe.index))

# Show the .columns property
print("\nColumns of the DataFrame:")
print(list(city_dataframe.columns))

# Show the .keys() method
print("\nKeys of the DataFrame:")
print(list(city_dataframe.keys()))


Index of the DataFrame:
['Chongqing', 'Shanghai', 'Tokyo', 'Moscow', 'Mexico City', 'London', 'New York']

Columns of the DataFrame:
['Population', 'Country']

Keys of the DataFrame:
['Population', 'Country']


## Requirement 9
Create a Pandas series object named my_pd_series from a collection of
string keys and collection of numeric values of your choosing. Demonstrate:
* modifying a value based on key
* slicing by explicit index
* slicing by implicit integer index
* masking
* fancy indexing
* loc[]
* iloc[]


In [21]:
import pandas as pd

# Create a Pandas Series
keys = ['A', 'B', 'C', 'D', 'E']
values = [10, 20, 30, 40, 50]
my_pd_series = pd.Series(values, index=keys)

# Modify a value based on key
my_pd_series['C'] = 35

# Slicing by explicit index
print("Slicing by explicit index:")
print(my_pd_series['B':'D'])  # Slicing from 'B' to 'D', inclusive

# Slicing by implicit integer index
print("\nSlicing by implicit integer index:")
print(my_pd_series[1:3])  # Slicing from index 1 to 2, excluding 3

# Masking
print("\nMasking:")
print(my_pd_series[my_pd_series > 30])  # Selecting values greater than 30

# Fancy indexing
print("\nFancy indexing:")
print(my_pd_series[['A', 'C', 'E']])  # Selecting values with keys 'A', 'C', and 'E'

# Using loc[]
print("\nUsing loc[]:")
print(my_pd_series.loc['B'])  # Accessing value using label 'B'

# Using iloc[]
print("\nUsing iloc[]:")
print(my_pd_series.iloc[3])  # Accessing value using integer index 3



Slicing by explicit index:
B    20
C    35
D    40
dtype: int64

Slicing by implicit integer index:
B    20
C    35
dtype: int64

Masking:
C    35
D    40
E    50
dtype: int64

Fancy indexing:
A    10
C    35
E    50
dtype: int64

Using loc[]:
20

Using iloc[]:
40


## Requirment 10
Using city_dataframe, demonstrate:
* access column via dictionary-style indexing of the column name
* access column via column names that are strings
* add a new column to the city_dataframe named altitude. Hint: See
this example for adding columns:

In [22]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                   'New York': 141297, 'Florida': 170312,
                   'Illinois': 149995})

pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})

data = pd.DataFrame({'area': area, 'pop': pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [24]:
# Access column via dictionary-style indexing of the column name
area_column = data['area']
print("Access column 'area' via dictionary-style indexing:")
print(area_column)

# Access column via column names that are strings
pop_column = data.pop
print("\nAccess column 'pop' via column names as strings:")
print(pop_column)

# Add a new column to the city_dataframe named 'altitude'
altitude_data = pd.Series({'California': 100, 'Texas': 200, 'New York': 50, 'Florida': 150, 'Illinois': 75})
data['altitude'] = altitude_data

print("\nDataFrame with new column 'altitude':")
print(data)


Access column 'area' via dictionary-style indexing:
California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

Access column 'pop' via column names as strings:
<bound method DataFrame.pop of               area       pop  altitude
California  423967  38332521      1000
Texas       695662  26448193       500
New York    141297  19651127       200
Florida     170312  19552860        50
Illinois    149995  12882135       300>

DataFrame with new column 'altitude':
              area       pop  altitude
California  423967  38332521       100
Texas       695662  26448193       200
New York    141297  19651127        50
Florida     170312  19552860       150
Illinois    149995  12882135        75


## Requirement 11
Create two Pandas series that when added using ‘+’ produce some NaN
entries. Use the .add() method and a fill_value to replace the NaN entries.

In [25]:
import pandas as pd

# Create the first Pandas Series
series1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])

# Create the second Pandas Series
series2 = pd.Series([5, 6, 7], index=['b', 'c', 'd'])

# Add the two series using the '+' operator
result = series1 + series2

print("Result of addition with NaN entries:")
print(result)

# Use the .add() method with fill_value to replace NaN entries
filled_result = series1.add(series2, fill_value=0)

print("\nFilled result using .add() with fill_value:")
print(filled_result)


Result of addition with NaN entries:
a     NaN
b     7.0
c     9.0
d    11.0
dtype: float64

Filled result using .add() with fill_value:
a     1.0
b     7.0
c     9.0
d    11.0
dtype: float64


## Requirement 12
Create two Pandas dataframes that when added using ‘+’ produce some
NaN entries. Use the .add() method and a fill_value of the mean of one of
the dataframes to replace the NaN entries.

In [30]:
import pandas as pd
import numpy as np

# Create the first Pandas DataFrame
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['X', 'Y', 'Z'])

# Create the second Pandas DataFrame
df2 = pd.DataFrame({'B': [7, 8, 9], 'C': [10, 11, 12]}, index=['Y', 'Z', 'W'])

# Add the two DataFrames using the '+' operator
result = df1 + df2

print("Result of addition with NaN entries:")
print(result)

# Calculate the mean of df1
mean_df1 = df1.mean()

# Fill NaN values in df2 with the mean of df1
filled_df2 = df2.fillna(mean_df1)

print("\nFilled df2 with mean of df1:")
print(filled_df2)

# Add df1 and filled_df2
filled_result = df1.add(filled_df2, fill_value=0)

print("\nFilled result using .add() with fill_value as the mean of df1:")
print(filled_result)

Result of addition with NaN entries:
    A     B   C
W NaN   NaN NaN
X NaN   NaN NaN
Y NaN  12.0 NaN
Z NaN  14.0 NaN

Filled df2 with mean of df1:
   B   C
Y  7  10
Z  8  11
W  9  12

Filled result using .add() with fill_value as the mean of df1:
     A     B     C
W  NaN   9.0  12.0
X  1.0   4.0   NaN
Y  2.0  12.0  10.0
Z  3.0  14.0  11.0


## Requirement 13
Create a two-dimensional NumPy array using:
* A = rng.randint(5, 10, size=(4, 4))
* Demonstrate: subtracting row 0 of A from A

In [34]:
import numpy as np

# Create a random number generator
rng = np.random.default_rng()

# Create a 4x4 NumPy array with random integers between 5 and 10
A = rng.integers(5, 10, size=(4, 4))

# Display the original array A
print("Original Array A:")
print(A)

# Subtract row 0 of A from A
result = A - A[0]

# Display the result
print("\nResult after subtracting row 0 of A from A:")
print(result)



Original Array A:
[[8 6 7 8]
 [6 5 6 9]
 [7 9 5 8]
 [8 5 6 9]]

Result after subtracting row 0 of A from A:
[[ 0  0  0  0]
 [-2 -1 -1  1]
 [-1  3 -2  0]
 [ 0 -1 -1  1]]


## Requirement 14
Create a Pandas dataframe using:
* df = pd.DataFrame(A, columns=list('QRST'))
* Demonstrate: subtracting row 1 of df from df using df.iloc[1]


In [35]:
import pandas as pd
import numpy as np

# Assuming A is already defined as a NumPy array
A = np.array([[5, 6, 7, 8],
              [9, 7, 6, 5],
              [8, 9, 6, 7],
              [5, 6, 7, 8]])

# Create a Pandas DataFrame using the NumPy array A
df = pd.DataFrame(A, columns=list('QRST'))

# Display the DataFrame df
print("DataFrame df:")
print(df)

# Subtract row 1 of df from df itself
result = df - df.iloc[1]

# Display the result
print("\nResult after subtracting row 1 of df from df:")
print(result)


DataFrame df:
   Q  R  S  T
0  5  6  7  8
1  9  7  6  5
2  8  9  6  7
3  5  6  7  8

Result after subtracting row 1 of df from df:
   Q  R  S  T
0 -4 -1  1  3
1  0  0  0  0
2 -1  2  0  2
3 -4 -1  1  3


## Requirement 15
Compare and contrast the two sentinel values Pandas uses to represent
missing data.

In [36]:
import pandas as pd
import numpy as np

# Creating a Pandas Series with NaN values
data = [1, 2, np.nan, 4, np.nan, 6]
series_with_nan = pd.Series(data)

# Checking for NaN values
print("NaN values in the Series:")
print(series_with_nan.isna())

# Filling NaN values with a specific value
filled_series = series_with_nan.fillna(0)
print("\nSeries with NaN values filled with 0:")
print(filled_series)


NaN values in the Series:
0    False
1    False
2     True
3    False
4     True
5    False
dtype: bool

Series with NaN values filled with 0:
0    1.0
1    2.0
2    0.0
3    4.0
4    0.0
5    6.0
dtype: float64


In [37]:
import pandas as pd

# Creating a Pandas DataFrame with None values
data = {'A': [1, 2, None, 4, None],
        'B': [None, 5, 6, None, 8]}
df_with_none = pd.DataFrame(data)

# Checking for None values
print("None values in the DataFrame:")
print(df_with_none.isna())

# Filling None values with a specific value
filled_df = df_with_none.fillna(0)
print("\nDataFrame with None values filled with 0:")
print(filled_df)


None values in the DataFrame:
       A      B
0  False   True
1  False  False
2   True  False
3  False   True
4   True  False

DataFrame with None values filled with 0:
     A    B
0  1.0  0.0
1  2.0  5.0
2  0.0  6.0
3  4.0  0.0
4  0.0  8.0


## Requirement 16
Demonstrate the %timeit difference between operations using Python
objects and Python integers

In [40]:
# Define two Python objects
obj1 = object()
obj2 = object()

# Define two Python integers
int1 = 10
int2 = 20

# Time the creation of Python objects
%timeit object()

# Time the creation of Python integers
%timeit 10 + 20


71 ns ± 1.74 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
8.32 ns ± 0.532 ns per loop (mean ± std. dev. of 7 runs, 100,000,000 loops each)


## Requirement 17
Create a Pandas series containing null data. Use .isnull() to identify the
entries that are null.

In [42]:
import pandas as pd

# Create a Pandas Series with null data
data = [1, None, 3, None, 5]
series_with_null = pd.Series(data)

# Identify the null entries using .isnull()
null_entries = series_with_null.isnull()

# Display the original series
print("Original Series:")
print(series_with_null)

# Display the null entries
print("\nNull Entries:")
print(null_entries)


Original Series:
0    1.0
1    NaN
2    3.0
3    NaN
4    5.0
dtype: float64

Null Entries:
0    False
1     True
2    False
3     True
4    False
dtype: bool


## Requirement 18
Create a Pandas dataframe containing null values. Demonstrate:
* drop all rows containing a null value
* drop all columns containing a null value
* drop only rows that contain all null values
* drop only columns that contain all null values
* replacing null with 0
* forward fill
* backward fill 

In [43]:
import pandas as pd
import numpy as np

# Create a Pandas DataFrame with null values
data = {
    'A': [1, 2, None, 4, 5],
    'B': [None, 6, 7, None, 9],
    'C': [10, 11, 12, 13, None]
}
df = pd.DataFrame(data)

# Drop all rows containing a null value
df_drop_rows = df.dropna()

# Drop all columns containing a null value
df_drop_columns = df.dropna(axis=1)

# Drop only rows that contain all null values
df_drop_rows_all_null = df.dropna(how='all')

# Drop only columns that contain all null values
df_drop_columns_all_null = df.dropna(axis=1, how='all')

# Replace null values with 0
df_replace_null_with_zero = df.fillna(0)

# Forward fill null values
df_forward_fill = df.fillna(method='ffill')

# Backward fill null values
df_backward_fill = df.fillna(method='bfill')

# Display the original DataFrame
print("Original DataFrame:")
print(df)

# Display the DataFrame after each operation
print("\nDataFrame after dropping rows containing a null value:")
print(df_drop_rows)

print("\nDataFrame after dropping columns containing a null value:")
print(df_drop_columns)

print("\nDataFrame after dropping only rows that contain all null values:")
print(df_drop_rows_all_null)

print("\nDataFrame after dropping only columns that contain all null values:")
print(df_drop_columns_all_null)

print("\nDataFrame after replacing null values with 0:")
print(df_replace_null_with_zero)

print("\nDataFrame after forward filling null values:")
print(df_forward_fill)

print("\nDataFrame after backward filling null values:")
print(df_backward_fill)


Original DataFrame:
     A    B     C
0  1.0  NaN  10.0
1  2.0  6.0  11.0
2  NaN  7.0  12.0
3  4.0  NaN  13.0
4  5.0  9.0   NaN

DataFrame after dropping rows containing a null value:
     A    B     C
1  2.0  6.0  11.0

DataFrame after dropping columns containing a null value:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]

DataFrame after dropping only rows that contain all null values:
     A    B     C
0  1.0  NaN  10.0
1  2.0  6.0  11.0
2  NaN  7.0  12.0
3  4.0  NaN  13.0
4  5.0  9.0   NaN

DataFrame after dropping only columns that contain all null values:
     A    B     C
0  1.0  NaN  10.0
1  2.0  6.0  11.0
2  NaN  7.0  12.0
3  4.0  NaN  13.0
4  5.0  9.0   NaN

DataFrame after replacing null values with 0:
     A    B     C
0  1.0  0.0  10.0
1  2.0  6.0  11.0
2  0.0  7.0  12.0
3  4.0  0.0  13.0
4  5.0  9.0   0.0

DataFrame after forward filling null values:
     A    B     C
0  1.0  NaN  10.0
1  2.0  6.0  11.0
2  2.0  7.0  12.0
3  4.0  7.0  13.0
4  5.0  9.0  13.0

DataFrame

## Requirement 19
Demonstrate Pandas MultiIndex techniques (can use book examples):
* .from_tuples()
* .reindex()
* .unstack()
* .stack()
* indexing and slicing

In [44]:
import pandas as pd

# Create a MultiIndex using .from_tuples()
index_tuples = [('A', 1), ('A', 2), ('B', 1), ('B', 2)]
multi_index = pd.MultiIndex.from_tuples(index_tuples, names=['Letter', 'Number'])

# Create a Series with the MultiIndex
data = [10, 20, 30, 40]
series = pd.Series(data, index=multi_index)

# Display the original Series
print("Original Series:")
print(series)

# Create a new MultiIndex with additional levels using .reindex()
new_index = [('A', 1, 'X'), ('A', 1, 'Y'), ('B', 2, 'X'), ('C', 1, 'Y')]
new_multi_index = pd.MultiIndex.from_tuples(new_index, names=['Letter', 'Number', 'Subgroup'])
reindexed_series = series.reindex(new_multi_index)

# Display the reindexed Series
print("\nReindexed Series:")
print(reindexed_series)

# Unstack the Series to convert it into a DataFrame
unstacked_df = series.unstack()

# Display the unstacked DataFrame
print("\nUnstacked DataFrame:")
print(unstacked_df)

# Stack the DataFrame back into a Series
stacked_series = unstacked_df.stack()

# Display the stacked Series
print("\nStacked Series:")
print(stacked_series)

# Indexing and slicing the Series with MultiIndex
print("\nIndexing and slicing:")
print("Value at index ('A', 2):", series.loc[('A', 2)])
print("Slicing using outer index 'A':")
print(series.loc['A'])


Original Series:
Letter  Number
A       1         10
        2         20
B       1         30
        2         40
dtype: int64

Reindexed Series:
Letter  Number  Subgroup
A       1       X           10.0
                Y           10.0
B       2       X           40.0
C       1       Y            NaN
dtype: float64

Unstacked DataFrame:
Number   1   2
Letter        
A       10  20
B       30  40

Stacked Series:
Letter  Number
A       1         10
        2         20
B       1         30
        2         40
dtype: int64

Indexing and slicing:
Value at index ('A', 2): 20
Slicing using outer index 'A':
Number
1    10
2    20
dtype: int64


## Requirement 20
Demonstrate concatenating two Pandas series. One series contains
automobile data. The other series contains motorcycle data.

In [46]:
import pandas as pd

# Automobile data
automobile_data = {
    'Brand': ['Toyota', 'Honda', 'Ford', 'Chevrolet'],
    'Model': ['Camry', 'Civic', 'F-150', 'Silverado'],
    'Year': [2018, 2019, 2020, 2021],
    'Type': ['Sedan', 'Sedan', 'Truck', 'Truck']
}
automobile_series = pd.Series(automobile_data)

# Motorcycle data
motorcycle_data = {
    'Brand': ['Honda', 'Yamaha', 'Kawasaki', 'Ducati'],
    'Model': ['CBR1000RR', 'YZF-R1', 'Ninja ZX-10R', 'Panigale V4'],
    'Year': [2019, 2020, 2021, 2022],
    'Type': ['Sportbike', 'Sportbike', 'Sportbike', 'Sportbike']
}
motorcycle_series = pd.Series(motorcycle_data)

# Concatenate the two series
concatenated_series = pd.concat([automobile_series, motorcycle_series])

# Display the concatenated series
print("Concatenated Series:")
print(concatenated_series)


Concatenated Series:
Brand                  [Toyota, Honda, Ford, Chevrolet]
Model                  [Camry, Civic, F-150, Silverado]
Year                           [2018, 2019, 2020, 2021]
Type                       [Sedan, Sedan, Truck, Truck]
Brand                 [Honda, Yamaha, Kawasaki, Ducati]
Model    [CBR1000RR, YZF-R1, Ninja ZX-10R, Panigale V4]
Year                           [2019, 2020, 2021, 2022]
Type       [Sportbike, Sportbike, Sportbike, Sportbike]
dtype: object


## Requirement 21
Demonstrate concatenating two Pandas dataframe using the .concat()
method.

In [47]:
import pandas as pd

# First DataFrame with automobile data
automobile_data1 = {
    'Brand': ['Toyota', 'Honda', 'Ford', 'Chevrolet'],
    'Model': ['Camry', 'Civic', 'F-150', 'Silverado'],
    'Year': [2018, 2019, 2020, 2021],
    'Type': ['Sedan', 'Sedan', 'Truck', 'Truck']
}
df1 = pd.DataFrame(automobile_data1)

# Second DataFrame with motorcycle data
motorcycle_data2 = {
    'Brand': ['Honda', 'Yamaha', 'Kawasaki', 'Ducati'],
    'Model': ['CBR1000RR', 'YZF-R1', 'Ninja ZX-10R', 'Panigale V4'],
    'Year': [2019, 2020, 2021, 2022],
    'Type': ['Sportbike', 'Sportbike', 'Sportbike', 'Sportbike']
}
df2 = pd.DataFrame(motorcycle_data2)

# Concatenate the two DataFrames
concatenated_df = pd.concat([df1, df2], ignore_index=True)

# Display the concatenated DataFrame
print("Concatenated DataFrame:")
print(concatenated_df)


Concatenated DataFrame:
       Brand         Model  Year       Type
0     Toyota         Camry  2018      Sedan
1      Honda         Civic  2019      Sedan
2       Ford         F-150  2020      Truck
3  Chevrolet     Silverado  2021      Truck
4      Honda     CBR1000RR  2019  Sportbike
5     Yamaha        YZF-R1  2020  Sportbike
6   Kawasaki  Ninja ZX-10R  2021  Sportbike
7     Ducati   Panigale V4  2022  Sportbike


## Requirement 22
Using Pandas dataframes, demonstrate the following joins:
* one-to-one
* many-to-one
* many-to-many

In [49]:
import pandas as pd

# One-to-one join
# Create the left DataFrame
left_df = pd.DataFrame({
    'employee_id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David']
})
# Create the right DataFrame
right_df = pd.DataFrame({
    'employee_id': [1, 2, 3, 4],
    'department': ['HR', 'Engineering', 'Marketing', 'Finance']
})
# Perform the one-to-one join
one_to_one_join = pd.merge(left_df, right_df, on='employee_id')

# Many-to-one join
# Create the left DataFrame
left_df = pd.DataFrame({
    'department': ['HR', 'Engineering', 'Marketing', 'Finance'],
    'manager': ['Alice', 'Bob', 'Charlie', 'David']
})
# Create the right DataFrame
right_df = pd.DataFrame({
    'employee_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve']
})
# Perform the many-to-one join
many_to_one_join = pd.merge(left_df, right_df, left_on='manager', right_on='name')

# Many-to-many join
# Create the left DataFrame
left_df = pd.DataFrame({
    'employee_id': [1, 2, 3, 4],
    'task': ['Task A', 'Task B', 'Task C', 'Task D']
})
# Create the right DataFrame
right_df = pd.DataFrame({
    'employee_id': [1, 2, 2, 3, 3, 4],
    'project': ['Project X', 'Project Y', 'Project Z', 'Project X', 'Project Y', 'Project Z']
})
# Perform the many-to-many join
many_to_many_join = pd.merge(left_df, right_df, on='employee_id')

# Display the results
print("One-to-one join:")
print(one_to_one_join)
print("\nMany-to-one join:")
print(many_to_one_join)
print("\nMany-to-many join:")
print(many_to_many_join)


One-to-one join:
   employee_id     name   department
0            1    Alice           HR
1            2      Bob  Engineering
2            3  Charlie    Marketing
3            4    David      Finance

Many-to-one join:
    department  manager  employee_id     name
0           HR    Alice            1    Alice
1  Engineering      Bob            2      Bob
2    Marketing  Charlie            3  Charlie
3      Finance    David            4    David

Many-to-many join:
   employee_id    task    project
0            1  Task A  Project X
1            2  Task B  Project Y
2            2  Task B  Project Z
3            3  Task C  Project X
4            3  Task C  Project Y
5            4  Task D  Project Z


## Requirement 23

Using Pandas dataframes, demonstrate merge with the following
keywords:
* on
* left_on and right_on
* left_index and right_index

In [50]:
import pandas as pd

# Sample data
left_data = {
    'employee_id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie'],
    'department': ['HR', 'Engineering', 'Marketing']
}

right_data = {
    'employee_id': [1, 2, 3],
    'salary': [50000, 60000, 70000]
}

# Create DataFrames
left_df = pd.DataFrame(left_data)
right_df = pd.DataFrame(right_data)

# Merge using 'on'
on_merge = pd.merge(left_df, right_df, on='employee_id')

# Merge using 'left_on' and 'right_on'
left_on_right_on_merge = pd.merge(left_df, right_df, left_on='employee_id', right_on='employee_id')

# Set index for DataFrames
left_df.set_index('employee_id', inplace=True)
right_df.set_index('employee_id', inplace=True)

# Merge using 'left_index' and 'right_index'
index_merge = pd.merge(left_df, right_df, left_index=True, right_index=True)

# Display the results
print("Merge using 'on':")
print(on_merge)

print("\nMerge using 'left_on' and 'right_on':")
print(left_on_right_on_merge)

print("\nMerge using 'left_index' and 'right_index':")
print(index_merge)


Merge using 'on':
   employee_id     name   department  salary
0            1    Alice           HR   50000
1            2      Bob  Engineering   60000
2            3  Charlie    Marketing   70000

Merge using 'left_on' and 'right_on':
   employee_id     name   department  salary
0            1    Alice           HR   50000
1            2      Bob  Engineering   60000
2            3  Charlie    Marketing   70000

Merge using 'left_index' and 'right_index':
                name   department  salary
employee_id                              
1              Alice           HR   50000
2                Bob  Engineering   60000
3            Charlie    Marketing   70000


## Requirement 24
Use markdown to include a statement at the end of assignment-11.ipynb
explaining your experiences with Assignment 11. Make this authentic
(minimum of 2-3 sentences).


Working on Assignment 11 was an insightful experience. Exploring various Pandas operations and techniques helped me deepen my understanding of data manipulation in Python. I particularly enjoyed experimenting with different join types and merging strategies, which broadened my knowledge of relational database concepts in the context of data analysis.
