## Introduction to Data Manipulation with Pandas

Welcome to Module 1.5! In the preceding sections, you've gained a strong foundation in Python programming, delved into Object-Oriented Programming (OOP), and explored data and array manipulation using NumPy. Now, we're stepping into the world of data analysis with Pandas.

**Pandas** is an essential library in the toolkit of every data scientist. It provides high-level data structures and powerful tools for data manipulation and analysis in Python. Built on top of NumPy, Pandas is designed to make data analysis fast and straightforward. It's a key component of the data science ecosystem and seamlessly integrates with NumPy-centric applications.

### Why Pandas?

You might be wondering, "Why do we need Pandas when we already have NumPy?" While NumPy excels at numerical and array-based operations, Pandas is tailored for working with structured data. It shines when dealing with data in tables, spreadsheets, or datasets with labeled columns and rows. With Pandas, you can efficiently load, clean, explore, transform, and analyze data.

### Key Concepts in Pandas

As we delve into Pandas, here are some key concepts and functionalities we'll explore:

- **DataFrame:** Understand the core data structure for tabular data in Pandas.
- **Series:** Explore the one-dimensional labeled array, a fundamental building block.
- **Data Loading:** Learn how to read data from various sources, including CSV, Excel, and SQL databases.
- **Data Cleaning and Preparation:** Handle missing data, duplicate records, and other data quality issues.
- **Data Exploration:** Use Pandas' powerful features for data summarization and exploration.
- **Indexing and Selection:** Access and filter data efficiently.
- **Data Transformation:** Perform operations like merging, reshaping, and aggregating data.

### Getting Started

Before we dive into Pandas, ensure that it's installed. If it's not already installed, you can typically install it using:


`!pip install pandas`

### Pandas Data Structure
To get started with pandas, you will need to get comfortable with its two workhorse
data structures: __Series and DataFrame__

#### Pandas Series
A Series is a one-dimensional array-like object containing an array of data (of any
NumPy data type) and an associated array of data labels, called its index

In [2]:
import pandas as pd

In [3]:
series_data = pd.Series([2,4,6,7,9])
series_data

0    2
1    4
2    6
3    7
4    9
dtype: int64

#### DataFrame
A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric,
string, boolean, etc.). A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns. It is similar to a spreadsheet, a SQL table or the data frame in R.

#### Creating DataFrame 
We are going to look at how we can create a Python DataFrame using a number of Methods
1. Using Python dictionary
2. Using List of dictionary
3. Using List of tuples
4. Loading from CSV
5. Loading from Excel

##### Using Python Dictionary

In [4]:
# First let's create a dictionary
forecast_dict = {
    "day":["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"],
    "tempeature": [25, 22, 23, 24, 25, 27, 22],
    "wind_speed": [6, 5, 2, 3, 3, 4, 5],
    "humidity": [22, 21, 23, 22, 23, 24, 22],
    "event": ["Sunny", "Windy", "Rainy","Rainy", "Windy", "Sunny", "Rainy"]
}

# Now, we'll ceate our dataFrame by passing the dictionary to the Pandas DataFrame
forecast_data = pd.DataFrame(forecast_dict)

In [5]:
forecast_data

Unnamed: 0,day,tempeature,wind_speed,humidity,event
0,Monday,25,6,22,Sunny
1,Tuesday,22,5,21,Windy
2,Wednesday,23,2,23,Rainy
3,Thursday,24,3,22,Rainy
4,Friday,25,3,23,Windy
5,Saturday,27,4,24,Sunny
6,Sunday,22,5,22,Rainy


##### Using List of Dictionary

In [6]:
#first let's create a list of dictionary
forecast_dict_list = [
    {"day":"Monday", "temperature":25, "wind_speed":6, "humidity":22, "event":"Sunny"},
    {"day":"Tuesday", "temperature":22, "wind_speed":5, "humidity":21, "event":"Windy"},
    {"day":"Wednesday", "temperature":23, "wind_speed":2, "humidity":23, "event":"Rainy"},
    {"day":"Thursday", "temperature":24, "wind_speed":3, "humidity":22, "event":"Rainy"},
    {"day":"Friday", "temperature":25, "wind_speed":3, "humidity":23, "event":"Windy"},
    {"day":"Saturday", "temperature":27, "wind_speed":4, "humidity":23, "event":"Windy"},
    {"day":"Sunday", "temperature":22, "wind_speed":5, "humidity":22, "event":"Rainy"},
]
forecast_data = pd.DataFrame(forecast_dict_list)
forecast_data

Unnamed: 0,day,temperature,wind_speed,humidity,event
0,Monday,25,6,22,Sunny
1,Tuesday,22,5,21,Windy
2,Wednesday,23,2,23,Rainy
3,Thursday,24,3,22,Rainy
4,Friday,25,3,23,Windy
5,Saturday,27,4,23,Windy
6,Sunday,22,5,22,Rainy


##### Using List of Tuples

In [7]:
# First, we have to ceate a list of tuples
forecast_tuple_list = [
    ("Monday",25, 6, 22, "Sunny"),
    ("Tuesday", 22, 5, 21, "Windy"),
    ("Wednesday", 23, 2, 23, "Rainy"),
    ("Thursday", 24, 3, 22, "Rainy"),
    ("Friday", 25, 3, 23, "Windy"),
    ("Saturday", 27, 4, 23, "Windy"),
    ("Sunday", 22, 5, 22, "Rainy"),
]
# We then past the list to Pandas DataFrame
forecast_data = pd.DataFrame(forecast_tuple_list, columns=["day", "temperature", "wind_speed", "humidity", "event"])
forecast_data

Unnamed: 0,day,temperature,wind_speed,humidity,event
0,Monday,25,6,22,Sunny
1,Tuesday,22,5,21,Windy
2,Wednesday,23,2,23,Rainy
3,Thursday,24,3,22,Rainy
4,Friday,25,3,23,Windy
5,Saturday,27,4,23,Windy
6,Sunday,22,5,22,Rainy


##### Loading from CSV
In order to use a Comma Seperated Value (CSV) data, we have to import it from the system by providing the file path to pd.read_csv() object. Pandas provides the read_csv() function to read data stored as a csv file into a pandas DataFrame. pandas supports many different file formats or data sources out of the box (csv, excel, sql, json, parquet, . . . ), each of them with the prefix read_*

In [9]:
df = pd.read_csv("eda.csv")
df

Unnamed: 0,Date,ProductID,SalesRepID,Units
0,1/8/24,2,2,269
1,1/11/24,2,4,77
2,1/16/24,2,2,247
3,1/18/24,4,2,221
4,1/19/24,2,4,63
...,...,...,...,...
247,12/29/24,2,2,2
248,12/30/24,2,4,111
249,12/31/24,2,2,97
250,12/31/24,2,1,1


##### Loading from Excel
In order to use a Excel spreadsheet, we have to import it from the system by providing the file path to pd.read_excel() object

In [10]:
df_excel = pd.read_excel("summary.xlsx")
df_excel

Unnamed: 0,Temperature,Rainfall,Flyers,Price,Sales
0,365.0,365.0,365.0,365.0,365.0
1,60.731233,0.826603,40.284932,0.333973,25.323288
2,16.196266,0.273171,13.178651,0.075206,6.893589
3,15.1,0.47,9.0,0.3,7.0
4,49.7,0.65,31.0,0.3,20.0
5,61.1,0.74,39.0,0.3,25.0
6,71.3,0.91,49.0,0.3,30.0
7,102.9,2.5,80.0,0.5,43.0


### DataFrame Basics

We'll cover the following here
1. Getting technical summary
2. Checking the first/last/random N rows
3. Selecting specific row(s)
4. Filtering specific rows
5. Filtering specififc rows and columns
6. Creating new column based on existing columns
7. Calculating Summary Statistics
8. Sorting
9. Manipulating text data
10. Checking fo Missing Values
11. Extracting Info about the value of a colum
12. Combining Multiple tables.

In [15]:
# get the technical summary of our dataset
df_excel.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Temperature  8 non-null      float64
 1   Rainfall     8 non-null      float64
 2   Flyers       8 non-null      float64
 3   Price        8 non-null      float64
 4   Sales        8 non-null      float64
dtypes: float64(5)
memory usage: 448.0 bytes


In [12]:
# Let's see the first 3 rows of the DataFrame
#pd.head()
df_excel.head(3)

Unnamed: 0,Temperature,Rainfall,Flyers,Price,Sales
0,365.0,365.0,365.0,365.0,365.0
1,60.731233,0.826603,40.284932,0.333973,25.323288
2,16.196266,0.273171,13.178651,0.075206,6.893589


In [13]:
# Let's see the last 4 rows
df_excel.tail(4)

Unnamed: 0,Temperature,Rainfall,Flyers,Price,Sales
4,49.7,0.65,31.0,0.3,20.0
5,61.1,0.74,39.0,0.3,25.0
6,71.3,0.91,49.0,0.3,30.0
7,102.9,2.5,80.0,0.5,43.0


In [14]:
df_excel.sample(3)

Unnamed: 0,Temperature,Rainfall,Flyers,Price,Sales
3,15.1,0.47,9.0,0.3,7.0
5,61.1,0.74,39.0,0.3,25.0
0,365.0,365.0,365.0,365.0,365.0


In [15]:
# Let's select a specific column of our DataFrame
# To select a single column, use square brackets [] with the column name of the column of interest
df_excel["Temperature"] # or df_excel.Temperature

0    365.000000
1     60.731233
2     16.196266
3     15.100000
4     49.700000
5     61.100000
6     71.300000
7    102.900000
Name: Temperature, dtype: float64

In [16]:
df_excel.Temperature

0    365.000000
1     60.731233
2     16.196266
3     15.100000
4     49.700000
5     61.100000
6     71.300000
7    102.900000
Name: Temperature, dtype: float64

In [17]:
# We can check the type and the shape of a single column
type(df_excel.Temperature)

pandas.core.series.Series

In [18]:
df_excel.Temperature.shape

(8,)

DataFrame.shape is an attribute  of a pandas Series and DataFrame containing the number of rows and columns: (nrows, ncolumns). A pandas Series is 1-dimensional and only the number of rows is returned <br>
__Note: We dont use parenthesis with shape because it is an attribute of a Pandas Series__

In [29]:
# To select multiple columns, use a list of column names within the selection brackets []
df_excel[["Temperature", "Rainfall"]]

Unnamed: 0,Temperature,Rainfall
0,365.0,365.0
1,60.731233,0.826603
2,16.196266,0.273171
3,15.1,0.47
4,49.7,0.65
5,61.1,0.74
6,71.3,0.91
7,102.9,2.5


In [31]:
df_excel[["Temperature", "Rainfall"]].shape
#temp_and_rain = df_excel[["Temperature", "Rainfall"]]
#print(temp_and_rain.shape)

(8, 2)

In [19]:
temp_and_rain = df_excel[["Temperature", "Rainfall"]]
temp_and_rain.head()
#print(temp_and_rain.shape)

Unnamed: 0,Temperature,Rainfall
0,365.0,365.0
1,60.731233,0.826603
2,16.196266,0.273171
3,15.1,0.47
4,49.7,0.65


The selection returned a DataFrame with 8 rows and 2 columns. Remember, a DataFrame is 2-dimensional with
both a row and column dimension

##### Filtering Specififc rows

In [20]:
# Let's say we want to get the rows (i.e datapoints) where the Temperature is greater than 50
# To select rows based on a conditional expression, use a condition inside the selection brackets []
df_excel["Temperature"] > 50

0     True
1     True
2    False
3    False
4    False
5     True
6     True
7     True
Name: Temperature, dtype: bool

In [22]:
df_excel[df_excel["Temperature"] > 50]

Unnamed: 0,Temperature,Rainfall,Flyers,Price,Sales
0,365.0,365.0,365.0,365.0,365.0
1,60.731233,0.826603,40.284932,0.333973,25.323288
5,61.1,0.74,39.0,0.3,25.0
6,71.3,0.91,49.0,0.3,30.0
7,102.9,2.5,80.0,0.5,43.0


In [23]:
df_excel[df_excel["Rainfall"] < 0.5]

Unnamed: 0,Temperature,Rainfall,Flyers,Price,Sales
2,16.196266,0.273171,13.178651,0.075206,6.893589
3,15.1,0.47,9.0,0.3,7.0


In [34]:
# Let's get the rows for which price is either 25 or 30

# (df_excel["Sales"] == 25) | (df_excel["Sales"] == 30)
df_excel[(df_excel["Sales"] == 25) | (df_excel["Sales"] == 30)]

Unnamed: 0,Temperature,Rainfall,Flyers,Price,Sales
5,61.1,0.74,39.0,0.3,25.0
6,71.3,0.91,49.0,0.3,30.0


In [24]:
# Another way to do this is to use the isin function
df_excel[df_excel["Sales"].isin([25, 30])]

Unnamed: 0,Temperature,Rainfall,Flyers,Price,Sales
5,61.1,0.74,39.0,0.3,25.0
6,71.3,0.91,49.0,0.3,30.0


##### Filtering Specififc rows and columns

In [36]:
# Let's say we are interested in rows 2 up to 6 and columns 1 up to 3
df_excel.iloc[2:7, 0:3]

Unnamed: 0,Temperature,Rainfall,Flyers
2,16.196266,0.273171,13.178651
3,15.1,0.47,9.0
4,49.7,0.65,31.0
5,61.1,0.74,39.0
6,71.3,0.91,49.0


In [38]:
# grab all the rows
df_excel.iloc[:, 0:3]

Unnamed: 0,Temperature,Rainfall,Flyers
0,365.0,365.0,365.0
1,60.731233,0.826603,40.284932
2,16.196266,0.273171,13.178651
3,15.1,0.47,9.0
4,49.7,0.65,31.0
5,61.1,0.74,39.0
6,71.3,0.91,49.0
7,102.9,2.5,80.0


__Note: When specifically interested in certain rows and/or columns based on their position in the table, use the iloc
operator in front of the selection brackets [].__

In [26]:
# Let's get the value Rainfall at a specific index (say 3 to 5)
df_excel.loc[3:5,"Rainfall"]

#df_excel.loc[3:5,["Temperature", "Rainfall"]]

3    0.47
4    0.65
5    0.74
Name: Rainfall, dtype: float64

##### Creating a new column based on existing columns

In [45]:
df_excel

Unnamed: 0,Temperature,Rainfall,Flyers,Price,Sales
0,365.0,365.0,365.0,365.0,365.0
1,60.731233,0.826603,40.284932,0.333973,25.323288
2,16.196266,0.273171,13.178651,0.075206,6.893589
3,15.1,0.47,9.0,0.3,7.0
4,49.7,0.65,31.0,0.3,20.0
5,61.1,0.74,39.0,0.3,25.0
6,71.3,0.91,49.0,0.3,30.0
7,102.9,2.5,80.0,0.5,43.0


In [28]:
# Let's create a new column called revenue from our existing data
# Note: revenue = price * sales
df_excel["Revenue"] = df_excel["Price"] * df_excel["Sales"]
df_excel

Unnamed: 0,Temperature,Rainfall,Flyers,Price,Sales,Revenue
0,365.0,365.0,365.0,365.0,365.0,133225.0
1,60.731233,0.826603,40.284932,0.333973,25.323288,8.457284
2,16.196266,0.273171,13.178651,0.075206,6.893589,0.518436
3,15.1,0.47,9.0,0.3,7.0,2.1
4,49.7,0.65,31.0,0.3,20.0,6.0
5,61.1,0.74,39.0,0.3,25.0,7.5
6,71.3,0.91,49.0,0.3,30.0,9.0
7,102.9,2.5,80.0,0.5,43.0,21.5


__Note: To create a new column, use the [] brackets with the new column name at the left side of the assignment.__

#### Calculating Summary Statistics for our Data

In [23]:
# The aggregating statistic can be calculated for multiple columns at the same time
df_excel.describe()

Unnamed: 0,Temperature,Rainfall,Flyers,Price,Sales,Revenue
count,8.0,8.0,8.0,8.0,8.0,8.0
mean,92.753437,46.421222,78.307948,45.888647,65.27711,16660.009465
std,113.674605,128.727058,117.915512,128.940509,121.68342,47099.36914
min,15.1,0.273171,9.0,0.075206,6.893589,0.518436
25%,41.324066,0.605,26.544663,0.3,16.75,5.025
50%,60.915616,0.783301,39.642466,0.3,25.161644,7.978642
75%,79.2,1.3075,56.75,0.375479,33.25,12.125
max,365.0,365.0,365.0,365.0,365.0,133225.0


In [47]:
# Let's say we want to get the mean tempearature
mean_temp = df_excel["Temperature"].mean()
print(f"The mean Temperature for this location is {mean_temp}")

The mean Temperature for this location is 92.75343731901152


In [48]:
# Let's get the average rainfall and Temperature
df_excel[["Temperature", "Rainfall"]].mean()

Temperature    92.753437
Rainfall       46.421222
dtype: float64

In [30]:
# Let's get the row for which the revenue was smallest
df_excel[df_excel["Revenue"] == df_excel["Revenue"].min()]

Unnamed: 0,Temperature,Rainfall,Flyers,Price,Sales,Revenue
2,16.196266,0.273171,13.178651,0.075206,6.893589,0.518436


In [32]:
min_rev = df_excel[df_excel["Revenue"] == df_excel["Revenue"].min()]


min_rev["Rainfall"]

2    0.273171
Name: Rainfall, dtype: float64

In [50]:
# Let's say we are only interestd in specific rows from the result e.g the rainfall for which the revenue was smallest
df_excel["Rainfall"][df_excel["Revenue"] == df_excel["Revenue"].min()]

2    0.273171
Name: Rainfall, dtype: float64

#### Sorting the data

In [52]:
# Let's sort our data using the revenue column and get the first 5 rows
df_excel.sort_values(by="Revenue").head(5)

Unnamed: 0,Temperature,Rainfall,Flyers,Price,Sales,Revenue
2,16.196266,0.273171,13.178651,0.075206,6.893589,0.518436
3,15.1,0.47,9.0,0.3,7.0,2.1
4,49.7,0.65,31.0,0.3,20.0,6.0
5,61.1,0.74,39.0,0.3,25.0,7.5
1,60.731233,0.826603,40.284932,0.333973,25.323288,8.457284


In [53]:
# Let's sort in descending order
df_excel.sort_values(by="Revenue", ascending=False).head(5)

Unnamed: 0,Temperature,Rainfall,Flyers,Price,Sales,Revenue
0,365.0,365.0,365.0,365.0,365.0,133225.0
7,102.9,2.5,80.0,0.5,43.0,21.5
6,71.3,0.91,49.0,0.3,30.0,9.0
1,60.731233,0.826603,40.284932,0.333973,25.323288,8.457284
5,61.1,0.74,39.0,0.3,25.0,7.5


In [54]:
# We can as well sort using a list of columns
df_excel.sort_values(by=["Revenue", "Rainfall"], ascending=False).head(5)

Unnamed: 0,Temperature,Rainfall,Flyers,Price,Sales,Revenue
0,365.0,365.0,365.0,365.0,365.0,133225.0
7,102.9,2.5,80.0,0.5,43.0,21.5
6,71.3,0.91,49.0,0.3,30.0,9.0
1,60.731233,0.826603,40.284932,0.333973,25.323288,8.457284
5,61.1,0.74,39.0,0.3,25.0,7.5


#### Manipulating Textual Data
This section uses the Titanic data set, stored as CSV. The data consists of the following data columns:
* PassengerId: Id of every passenger.
* Survived: This feature have value 0 and 1. 0 for not survived and 1 for survived.
* Pclass: There are 3 classes: Class 1, Class 2 and Class 3.
* Name: Name of passenger.
* Sex: Gender of passenger.
* Age: Age of passenger.
* SibSp: Indication that passenger have siblings and spouse.
* Parch: Whether a passenger is alone or have family.
* Ticket: Ticket number of passenger.
* Fare: Indicating the fare.
* Cabin: The cabin of passenger.
* Embarked: The embarked category

In [36]:
titanic_df = pd.read_csv("titanic.csv")
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [37]:
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [38]:
# Let's make all the name character lowercase
titanic_df["Name"].str.lower()

0                                braund, mr. owen harris
1      cumings, mrs. john bradley (florence briggs th...
2                                 heikkinen, miss. laina
3           futrelle, mrs. jacques heath (lily may peel)
4                               allen, mr. william henry
                             ...                        
886                                montvila, rev. juozas
887                         graham, miss. margaret edith
888             johnston, miss. catherine helen "carrie"
889                                behr, mr. karl howell
890                                  dooley, mr. patrick
Name: Name, Length: 891, dtype: object

In [39]:
name  = "Braund, Mr. Owen Harris"
name.split(",")[0]

'Braund'

In [40]:
# Let's create a column named surname by extracting the first part of the Name column
titanic_df["Surname"] = titanic_df["Name"].str.split(",").str.get(0)
titanic_df.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Surname
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Braund
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Cumings
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Heikkinen
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Futrelle
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Allen


In [63]:
# Let's see how many passengers bear the Owen
titanic_df[titanic_df["Name"].str.contains("Owen")]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Surname
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Braund
834,835,0,3,"Allum, Mr. Owen George",male,18.0,0,0,2223,8.3,,S,Allum


#### Checking for Missing Values

In [42]:
# first let's see the columns with missing data (i.e Nan)
titanic_df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
Surname          0
dtype: int64

In [43]:
# Let's only work with passenger data for which the age is known
passenger_with_known_age = titanic_df[titanic_df["Age"].notna()]
passenger_with_known_age.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Surname
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Braund
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Cumings
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Heikkinen
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Futrelle
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Allen


In [44]:
passenger_with_known_age.shape

(714, 13)

In [38]:
# we can check the shape here
passenger_with_known_age.shape

(714, 13)

#### Extracting information about the value of a column

In [45]:
# We can get the info about the unique value contained in a column
# for example, let's see the distinct Sex
titanic_df["Sex"].unique()

array(['male', 'female'], dtype=object)

In [46]:
# We can check the total number of each distict value  using value count
titanic_df["Sex"].value_counts()

male      577
female    314
Name: Sex, dtype: int64

### Mini project 8 -- Revisiting Ali and Johnson
As Ali & Jonson continues to grow, keeping all of their data in a dicionary is becoming inefficient. You are approached as the in-house data guy to find a solution to this problem. Below is the dictionary containing the products soldand various fields that must be transformed to an efficeint storage format.

1. Convert this dictionary to a Pandas DataFrame
2. Altogether, how many products are present in your newly created DataFrame?
3. How many unique product categories are present in the DataFrame?
4. What category has the most product sold?
5. What is the least Price you can expect to get a product from Ali & Johnson? What product goes for this price?
6. What is the total number of product sold during this period?
7. Create a new field (i.e column) to calculate the revenue generated selling each product. [revenue = price * unit sold]
8. Which product category generated the highest revenue?

In [8]:
product_dict = {"Product Name":
                ['footwear', 'eyewear','bags','table','ear ring','chair','couch','bed frame', 'TV','Sunglass','refrigirator'],
               "Product Category":
                ['Fashion', 'Fashion', 'Fashion', 'Furniture', 'Fashion', 'Furniture','Furniture','Furniture','Electronics', 'Fashion', 'Electronics'],
               "Unit Sold":[3,4,2,6,4,5,2,1,8, 8,7],
               "Price":[10,5,7,15,20,17,23,29,54,6,60]}