# Data Analytics Module I - Chapters 1 and 2

## Load Pandas

Python doesn’t load all of the libraries available to it by default. We have to add an import statement to our code in order to use library functions.


In [1]:
#Import the pandas library as 'pd'

import pandas as pd

When we invoke a function from a library we use the following syntax: **LibraryName.FunctionName**, in this case we can call it **pandas.FunctionName**

By giving *pandas* a *nickname* such as **pd**, it makes our lives easier now that we can call the function **pd.FunctionName** instead.
This smart trick allows us to avoid typing out the full “pandas” keyword every time we use a function from the Pandas library.


## Read CSV file using Pandas

Pandas  can be used to import data stored in a Comma-Separated Values (CSV) file format. CSV is a common and simple way of structuring tabular data, where each line corresponds to a row and the values within a line are separated by commas.


In [2]:
# Read csv file using pandas

#Step 1: Specify File Path
file_path = 'data/sample_dataset_pandas.csv'

#Step 2: Read CSV file
pd.read_csv(file_path)

Unnamed: 0,Id,Gender,Age,Income,Height,Weight
0,1,M,64.0,24141.0,,91.026092
1,2,F,18.0,82523.0,189.533369,72.400692
2,3,,40.0,37928.0,153.884926,90.153410
3,4,F,47.0,83622.0,189.395822,63.506353
4,5,M,43.0,120475.0,155.234732,85.101889
...,...,...,...,...,...,...
495,496,M,52.0,143234.0,156.044812,66.805198
496,497,M,,56023.0,159.155331,80.400186
497,498,,44.0,45923.0,180.344630,63.905775
498,499,F,40.0,72164.0,151.121012,


This code returns an overview of how the dataset looks like, returning the first and last five rows. 

The **read_csv** function has successfully processed our file but has not yet stored it into memory for further processing and analysis, so to do this we will add a new variable called “df”, short for dataframe:

In [3]:
# Save csv file to memory using pandas 
df = pd.read_csv(file_path)
df

Unnamed: 0,Id,Gender,Age,Income,Height,Weight
0,1,M,64.0,24141.0,,91.026092
1,2,F,18.0,82523.0,189.533369,72.400692
2,3,,40.0,37928.0,153.884926,90.153410
3,4,F,47.0,83622.0,189.395822,63.506353
4,5,M,43.0,120475.0,155.234732,85.101889
...,...,...,...,...,...,...
495,496,M,52.0,143234.0,156.044812,66.805198
496,497,M,,56023.0,159.155331,80.400186
497,498,,44.0,45923.0,180.344630,63.905775
498,499,F,40.0,72164.0,151.121012,


If the dataset contains many samples then it is a good idea to use the **head()** function of Pandas to see the first few samples of the dataset. The function head() by itself returns the first 5 rows, but we can also specify how many rows we want to display by adding a number as a parameter in the function: **head(*10*)**

In [4]:
#Print the first rows of our dataframe
df.head(100)

Unnamed: 0,Id,Gender,Age,Income,Height,Weight
0,1,M,64.0,24141.0,,91.026092
1,2,F,18.0,82523.0,189.533369,72.400692
2,3,,40.0,37928.0,153.884926,90.153410
3,4,F,47.0,83622.0,189.395822,63.506353
4,5,M,43.0,120475.0,155.234732,85.101889
...,...,...,...,...,...,...
95,96,,55.0,94473.0,176.486080,
96,97,,59.0,68462.0,158.627039,88.250594
97,98,M,,120095.0,,89.292180
98,99,,58.0,136820.0,,76.403494


We can also check what kind of things **df** contains using **dtypes**. What kind of data types our dataframe contains:

In [5]:
#Print the data types in the dataframe
df.dtypes

Id          int64
Gender     object
Age       float64
Income    float64
Height    float64
Weight    float64
dtype: object

## Explore the DataFrame Object

Let’s explore the DataFrame Object further. We will be using both methods and attributes.

**Methods** are functions that we can apply to the DataFrame to perform specific operations. They usually require parentheses. If we wish to see the information of a dataframe, we can use the **info()** function:

In [6]:
#Print dataframe information
info = df.info()
print(info)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Id      500 non-null    int64  
 1   Gender  335 non-null    object 
 2   Age     452 non-null    float64
 3   Income  396 non-null    float64
 4   Height  453 non-null    float64
 5   Weight  458 non-null    float64
dtypes: float64(4), int64(1), object(1)
memory usage: 23.6+ KB
None


This summary provides valuable information about the DataFrame’s structure, data types, and the presence of missing values. It’s a quick overview that helps you understand the content and characteristics of the DataFrame.

We can use the **unique()** function to identify the distinct values within a column or an array.

In [7]:
#Identify distinct values within a column
pd.unique(df['Gender'])

array(['M', 'F', nan], dtype=object)

It returns the unique values in the **Gender** column, such as F, M, and nan.

**Attributes** are properties of the DataFrame that provide information about its characteristics. They don’t require parentheses. If we wish to see the shape, number of rows and columns, of the dataframe we can use the **shape** attribute:

In [8]:
# Get the shape of the DataFrame (attribute)
shape = df.shape
print(shape)

(500, 6)


### Exercise 1
What would be the output of the following commands?
-  df.tail()
-  df.columns


In [9]:
df.tail(10)

Unnamed: 0,Id,Gender,Age,Income,Height,Weight
490,491,F,60.0,117648.0,181.056243,
491,492,,49.0,50634.0,155.796445,95.706997
492,493,M,58.0,21617.0,178.247016,96.618002
493,494,F,34.0,28917.0,150.552383,95.738284
494,495,,55.0,60864.0,182.079665,71.311427
495,496,M,52.0,143234.0,156.044812,66.805198
496,497,M,,56023.0,159.155331,80.400186
497,498,,44.0,45923.0,180.34463,63.905775
498,499,F,40.0,72164.0,151.121012,
499,500,,42.0,,170.127901,51.950306


In [10]:
df.columns

Index(['Id', 'Gender', 'Age', 'Income', 'Height', 'Weight'], dtype='object')

df.tail() is a method and returns the last 5 rows of our dataframe. Df.columns is an attribute that provides access to the column labes and dtype.

## Selecting Data Using Labels

To select a single column, use the DataFrame’s name followed by the column label in square brackets **['ColumnLabel']**.

In [11]:
# Select the "Age" column
df['Age']

0      64.0
1      18.0
2      40.0
3      47.0
4      43.0
       ... 
495    52.0
496     NaN
497    44.0
498    40.0
499    42.0
Name: Age, Length: 500, dtype: float64

We can also use the column name as an *attribute* to access data from that column using **df.Age**


In [12]:
# Select the 'Age' column as an attribute
df.Age

0      64.0
1      18.0
2      40.0
3      47.0
4      43.0
       ... 
495    52.0
496     NaN
497    44.0
498    40.0
499    42.0
Name: Age, Length: 500, dtype: float64

To select multiple columns, enclose the column labels in double square brackets **[['Column1', 'Column2']]**.

In [13]:
# Select the 'Age' and 'Income' columns
df[['Age','Income']]

Unnamed: 0,Age,Income
0,64.0,24141.0
1,18.0,82523.0
2,40.0,37928.0
3,47.0,83622.0
4,43.0,120475.0
...,...,...
495,52.0,143234.0
496,,56023.0
497,44.0,45923.0
498,40.0,72164.0


We can also create a new object and store the result, and later we can access the result from the object.

In [14]:
# Select the 'Age' and 'Income' columns and store in a object
age_income_columns = df[['Age','Income']]
print("Selected Age and Income columns:\n",age_income_columns)

Selected Age and Income columns:
       Age    Income
0    64.0   24141.0
1    18.0   82523.0
2    40.0   37928.0
3    47.0   83622.0
4    43.0  120475.0
..    ...       ...
495  52.0  143234.0
496   NaN   56023.0
497  44.0   45923.0
498  40.0   72164.0
499  42.0       NaN

[500 rows x 2 columns]


### Exercise 2
What happens if you ask for a column that doesn’t exist?
-  df['Name']


In [15]:
df['Name']

KeyError: 'Name'

## Extracting Range-based Subsets (Slicing)
Slicing is a technique used to extract a portion or subset of elements from a sequence, such as a list, array, or string. It allows us to specify a range of indices to retrieve a subset of the data.

-  Getting Specific Elements
-  Getting a Set of Elements
-  Getting First Few Elements
-  Getting Last Few Elements

Let's go through a simple example of a list before moving back to dataframes:


In [16]:
# Sample list
my_list = [10,20,30,40,50,60,70,80]

# Getting specific elements
element_at_index_2 = my_list[2]
print("Element at index 2: ", element_at_index_2)

# Getting a set of elements
subset = my_list[2:5]# upper bound not included
print("Subset from index 2 to 4: ", subset)

# Getting first few elements
first_three_elements = my_list[:3]# 3 is included
print("First three elements: ", first_three_elements)

# Getting last few elements
last_two_elements  = my_list[-2:]
print("Last two elements:", last_two_elements)


Element at index 2:  30
Subset from index 2 to 4:  [30, 40, 50]
First three elements:  [10, 20, 30]
Last two elements: [70, 80]


In [17]:
# Print First, third and fifth number of my_list
print(" 1. First, third, and fifth numbers:", [my_list[i] for i in [0, 2, 4]])
print("2. First, third, and fifth numbers:",  my_list[0:5:2])

 1. First, third, and fifth numbers: [10, 30, 50]
2. First, third, and fifth numbers: [10, 30, 50]


### Exercise 3
What would be the output of the following command
-  my_list[len(my_list)]


In [18]:
my_list[len(my_list)]

IndexError: list index out of range

In [19]:
len(my_list)

#The indexes of our list goes from 0 to len(my_list)-1

8

In [20]:
my_list[len(my_list)-1]

80

## Slicing Rows and Columns
Slicing rows and columns simultaneously involves using **.loc** or **.iloc** and specifying the row indices and column labels or indices we want to include.

-  **.loc** is label-based indexing, meaning we specify the row and column labels.
-  **.iloc** is integer-based indexing, meaning we use integer indices for rows and columns.

#### Using .loc 

In [21]:
# Slice rows 1 to 3 and columns 'Gender' and 'Age' using .loc
sliced_rows_columns_loc = df.loc[1:3, ['Gender','Age']]
print("Sliced Rows and Columns using .loc:\n", sliced_rows_columns_loc)

Sliced Rows and Columns using .loc:
   Gender   Age
1      F  18.0
2    NaN  40.0
3      F  47.0


Now, if we want to select **‘Gender’, ’Age’, and ‘Weight’** columns with row labels **“1, 3, 4”**, we can also do this using the below code:

In [22]:
# Slice rows “1, 3, 4” and columns 'Gender','Age', and 'Weight'
sliced_rows_columns_loc2 = df.loc[[1,3,4],['Gender','Age','Weight']]
print("Sliced Rows and Columns using .loc:\n", sliced_rows_columns_loc2)

Sliced Rows and Columns using .loc:
   Gender   Age     Weight
1      F  18.0  72.400692
3      F  47.0  63.506353
4      M  43.0  85.101889


### Using .iloc: 

In [23]:
# Slice rows 1 to 3 and columns at index 1 to 3 using .iloc
sliced_rows_columns_iloc = df.iloc[1:4,1:4] # 4 is not included
print("Sliced Rows and Columns using .iloc:\n", sliced_rows_columns_iloc)

Sliced Rows and Columns using .iloc:
   Gender   Age   Income
1      F  18.0  82523.0
2    NaN  40.0  37928.0
3      F  47.0  83622.0


In both cases, using *.loc* or *.iloc*, the first argument specifies the rows to include, and the second argument specifies the columns to include. 

## Subsetting Data using Criteria
Subsetting data using criteria involves selecting a subset of rows from a DataFrame based on specific conditions. This is often done to filter out rows that meet certain criteria or to focus on specific data points that are relevant to our analysis. 

We can use conditional statements to filter rows based on specific criteria. The condition is typically applied to a column, and rows meeting the condition are retained.

For example, let’s say we want to subset the DataFrame to only include individuals with an age greater than 25


In [24]:
# Subset data for individuals with age > 25
subset_age_greater_25 = df[df['Age'] > 25]
print("Subset of individuals with age > 25:\n", subset_age_greater_25)

Subset of individuals with age > 25:
       Id Gender   Age    Income      Height     Weight
0      1      M  64.0   24141.0         NaN  91.026092
2      3    NaN  40.0   37928.0  153.884926  90.153410
3      4      F  47.0   83622.0  189.395822  63.506353
4      5      M  43.0  120475.0  155.234732  85.101889
5      6      M  62.0  120535.0  179.602918  76.109803
..   ...    ...   ...       ...         ...        ...
494  495    NaN  55.0   60864.0  182.079665  71.311427
495  496      M  52.0  143234.0  156.044812  66.805198
497  498    NaN  44.0   45923.0  180.344630  63.905775
498  499      F  40.0   72164.0  151.121012        NaN
499  500    NaN  42.0       NaN  170.127901  51.950306

[374 rows x 6 columns]


Also, we can combine multiple criteria using logical operators such as **&** *(AND)* and **|** *(OR)* to create more complex conditions.

For instance, to subset the DataFrame for individuals with an age greater than 25 and an income greater than 60000:

In [25]:
# Subset data for individuals with age > 25 and income > 60000
subset_age_income = df[(df['Age'] > 25) & (df['Income'] > 60000)]
print("Subset of individuals with age > 25 and income > 60000:\n", subset_age_income)

Subset of individuals with age > 25 and income > 60000:
       Id Gender   Age    Income      Height     Weight
3      4      F  47.0   83622.0  189.395822  63.506353
4      5      M  43.0  120475.0  155.234732  85.101889
5      6      M  62.0  120535.0  179.602918  76.109803
6      7      M  53.0  145042.0  176.081178        NaN
7      8    NaN  61.0   97758.0  154.752066  81.734248
..   ...    ...   ...       ...         ...        ...
487  488      F  44.0  118330.0  179.536224  88.574138
490  491      F  60.0  117648.0  181.056243        NaN
494  495    NaN  55.0   60864.0  182.079665  71.311427
495  496      M  52.0  143234.0  156.044812  66.805198
498  499      F  40.0   72164.0  151.121012        NaN

[209 rows x 6 columns]


We can also use the **~** symbol to negate a condition. For example, to subset the DataFrame for individuals with an age less than or equal to 25:

In [26]:
# Subset data for individuals with age <= 25
subset_age_le_25 = df[~(df['Age'] > 25)]
print("Subset of individuals with age <= 25:\n", subset_age_le_25)           

Subset of individuals with age <= 25:
       Id Gender   Age    Income      Height     Weight
1      2      F  18.0   82523.0  189.533369  72.400692
11    12      M  21.0  134072.0  188.650997  52.732921
12    13    NaN  19.0       NaN  173.446142  61.146883
15    16    NaN  23.0   52297.0  171.859691  85.833607
16    17    NaN  23.0  126215.0  185.392064  81.741455
..   ...    ...   ...       ...         ...        ...
464  465    NaN   NaN       NaN  175.017878        NaN
474  475      F   NaN       NaN  187.281876  92.718267
479  480    NaN  22.0  140959.0  189.221855  94.844926
482  483    NaN  21.0   25786.0  155.483269  55.226594
496  497      M   NaN   56023.0  159.155331  80.400186

[126 rows x 6 columns]


The **isin()** function is used to filter data based on whether values are present in a specified list or iterable. It’s a convenient way to subset data when we want to select rows that match specific values for a particular column.

Let’s say we want to select rows where the **‘Gender’** column has values **‘M’** or **‘F’**:


In [27]:
# Subsetting data using isin() function
subset_gender = df[df['Gender'].isin(['M','F'])]
print("Subset of data with 'Gender' values M or F :\n", subset_gender)

Subset of data with 'Gender' values M or F :
       Id Gender   Age    Income      Height     Weight
0      1      M  64.0   24141.0         NaN  91.026092
1      2      F  18.0   82523.0  189.533369  72.400692
3      4      F  47.0   83622.0  189.395822  63.506353
4      5      M  43.0  120475.0  155.234732  85.101889
5      6      M  62.0  120535.0  179.602918  76.109803
..   ...    ...   ...       ...         ...        ...
492  493      M  58.0   21617.0  178.247016  96.618002
493  494      F  34.0   28917.0  150.552383  95.738284
495  496      M  52.0  143234.0  156.044812  66.805198
496  497      M   NaN   56023.0  159.155331  80.400186
498  499      F  40.0   72164.0  151.121012        NaN

[335 rows x 6 columns]


In [28]:
# Subsetting data using isin()function for 'Id' column
subset_Id = df[df['Id'].isin([1,2,3])]
print("Subset of data with 'Id' values 1,2,3:\n", subset_Id)

Subset of data with 'Id' values 1,2,3:
    Id Gender   Age   Income      Height     Weight
0   1      M  64.0  24141.0         NaN  91.026092
1   2      F  18.0  82523.0  189.533369  72.400692
2   3    NaN  40.0  37928.0  153.884926  90.153410


In [29]:
# Subsetting data by negating isin() function using ~
subset_gender = df[~df['Gender'].isin(['M','F'])]
print("Subset of data with 'Gender' values M or F :\n", subset_gender)

Subset of data with 'Gender' values M or F :
       Id Gender   Age    Income      Height     Weight
2      3    NaN  40.0   37928.0  153.884926  90.153410
7      8    NaN  61.0   97758.0  154.752066  81.734248
8      9    NaN  44.0  137929.0  150.101907  98.184214
12    13    NaN  19.0       NaN  173.446142  61.146883
15    16    NaN  23.0   52297.0  171.859691  85.833607
..   ...    ...   ...       ...         ...        ...
488  489    NaN  30.0   23785.0  165.755358        NaN
491  492    NaN  49.0   50634.0  155.796445  95.706997
494  495    NaN  55.0   60864.0  182.079665  71.311427
497  498    NaN  44.0   45923.0  180.344630  63.905775
499  500    NaN  42.0       NaN  170.127901  51.950306

[165 rows x 6 columns]


**isnull()** and **notnull()** functions are used to detect missing (NaN) values in a DataFrame. **isnull()** returns a DataFrame of the same shape as the input, with True values indicating missing values. **notnull()** returns the opposite.

Let’s say we want to select rows where the ‘Age’ column *has missing values*:


In [30]:
# Subsetting data using isnull() function
subset_missing_age = df[df['Age'].isnull()]
print("Subset of data with missing 'Age' values:\n", subset_missing_age)

Subset of data with missing 'Age' values:
       Id Gender  Age    Income      Height     Weight
34    35      M  NaN       NaN         NaN  77.748112
48    49      M  NaN   80545.0  160.882447  84.132224
49    50      M  NaN   84812.0  181.811067  50.797431
62    63      F  NaN       NaN  160.336857  68.404654
76    77      M  NaN   96922.0  150.758791  51.762343
81    82    NaN  NaN   85039.0  176.034126  82.357999
89    90    NaN  NaN  104684.0  151.599414  75.361666
91    92      F  NaN  144675.0  180.480869  56.129768
97    98      M  NaN  120095.0         NaN  89.292180
100  101    NaN  NaN   41682.0  154.944984  74.792711
121  122      M  NaN   35769.0  184.735616  68.987392
133  134    NaN  NaN       NaN  155.219858  79.753255
138  139    NaN  NaN       NaN  156.132598  96.430512
162  163    NaN  NaN   49601.0  169.890227  79.864129
193  194      M  NaN   76822.0  166.203285  77.342092
202  203      M  NaN  119766.0  162.412929  57.717951
203  204      M  NaN   80392.0  170.627

Let’s say we want to select rows where the ‘Age’ column *does not have missing values*:

In [31]:
# Subsetting data using notnull() function
subset_notMissing_age = df[df['Age'].notnull()]
print("Subset of data without missing 'Age' values:\n", subset_notMissing_age)

Subset of data without missing 'Age' values:
       Id Gender   Age    Income      Height     Weight
0      1      M  64.0   24141.0         NaN  91.026092
1      2      F  18.0   82523.0  189.533369  72.400692
2      3    NaN  40.0   37928.0  153.884926  90.153410
3      4      F  47.0   83622.0  189.395822  63.506353
4      5      M  43.0  120475.0  155.234732  85.101889
..   ...    ...   ...       ...         ...        ...
494  495    NaN  55.0   60864.0  182.079665  71.311427
495  496      M  52.0  143234.0  156.044812  66.805198
497  498    NaN  44.0   45923.0  180.344630  63.905775
498  499      F  40.0   72164.0  151.121012        NaN
499  500    NaN  42.0       NaN  170.127901  51.950306

[452 rows x 6 columns]


In [32]:
# Subsetting data using notnull() for 'Age' and 'Income'

# 1. Using the & operator
subset_notnull_age_income = df[(df['Age'].notnull()) & (df['Income'].notnull())]
print("Subset not null using & operator:\n", subset_notnull_age_income)

#2. Selecting data using labels
subset_notnull_age_income_2 = df[df[['Age','Income']].notnull().all(axis=1)] ## FIXED CODE, .all(axis=1) WAS MISSING
print("Subset not null using column labels:\n", subset_notnull_age_income_2)

Subset not null using & operator:
       Id Gender   Age    Income      Height     Weight
0      1      M  64.0   24141.0         NaN  91.026092
1      2      F  18.0   82523.0  189.533369  72.400692
2      3    NaN  40.0   37928.0  153.884926  90.153410
3      4      F  47.0   83622.0  189.395822  63.506353
4      5      M  43.0  120475.0  155.234732  85.101889
..   ...    ...   ...       ...         ...        ...
493  494      F  34.0   28917.0  150.552383  95.738284
494  495    NaN  55.0   60864.0  182.079665  71.311427
495  496      M  52.0  143234.0  156.044812  66.805198
497  498    NaN  44.0   45923.0  180.344630  63.905775
498  499      F  40.0   72164.0  151.121012        NaN

[359 rows x 6 columns]
Subset not null using column labels:
       Id Gender   Age    Income      Height     Weight
0      1      M  64.0   24141.0         NaN  91.026092
1      2      F  18.0   82523.0  189.533369  72.400692
2      3    NaN  40.0   37928.0  153.884926  90.153410
3      4      F  47.0  

The all(axis=1) method is used to check if all elements in a specified axis meet a certain condition.

-  axis=1 refers to rows in a DataFrame.
-  axis=0 refers to columns.

## Calculating Statistics from Pandas DataFrame
We can use Pandas DataFrame’s built-in methods to quickly generate summary statistics for our data. Such as, we can use the **describe()** function to get summary statistics for numerical columns like count, mean, standard deviation, minimum, and maximum.

In [33]:
# Print the summary statistics of the dataframe using describe()
summary = df.describe()
print(summary)

               Id         Age         Income      Height      Weight
count  500.000000  452.000000     396.000000  453.000000  458.000000
mean   250.500000   41.856195   86108.492424  169.529216   75.025116
std    144.481833   13.940613   36654.922564   10.878351   14.774209
min      1.000000   18.000000   20045.000000  150.015783   50.295231
25%    125.750000   29.000000   57367.000000  160.017565   61.837609
50%    250.500000   42.000000   88871.000000  169.635492   75.331686
75%    375.250000   53.250000  116668.000000  178.288046   88.464562
max    500.000000   65.000000  149846.000000  189.968273   99.954461


If we want to calculate the standard deviation of a numerical column we can use **std()** function.

In [34]:
#Print the standard deviation of the Age column using std()
age_std = df['Age'].std()
print("Age Standard Deviation:", age_std)

Age Standard Deviation: 13.940613087236967


There are many more statistics formulas that you can use, I encourage you to check out the following resources:
-  https://www.tutorialspoint.com/python_pandas/python_pandas_descriptive_statistics.htm
-  https://www.scaler.com/topics/pandas/statistical-functions-in-pandas/

I promise you will have fun!

## Groups in Pandas

Frequently, there’s a need to compute summary statistics based on subsets or specific attributes within our dataset. For instance, we might wish to find the summary statistics of the income of all individuals, we can do it using the following code:

In [35]:
#Return summary statistics of the Income column 
df['Income'].describe()

count       396.000000
mean      86108.492424
std       36654.922564
min       20045.000000
25%       57367.000000
50%       88871.000000
75%      116668.000000
max      149846.000000
Name: Income, dtype: float64

Again, we might also want to get only specific information, like the maximum:

In [36]:
#Return the maximum value of the Income column
df['Income'].max()

149846.0

or we can get the average income of all individuals:

In [37]:
#Return the average income of all individuals
df['Income'].mean()

86108.49242424243

However, when the intention is to summarize data based on one or more variables, such as gender, the Pandas library offers the **.groupby** method. Once a DataFrame is grouped using this approach, we have the ability to compute summary statistics of the selected grouping.

In [38]:
# Group data by sex
grouped_data = df.groupby('Age')

# Provide the mean for each numeric column by sex
grouped_data.mean(numeric_only = True)

Unnamed: 0_level_0,Id,Income,Height,Weight
Age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
18.0,195.090909,96315.375,170.103731,73.987753
19.0,246.25,87691.666667,171.62313,70.69507
20.0,228.083333,87823.636364,168.674221,73.915529
21.0,226.0,89577.777778,171.518129,76.492599
22.0,253.875,114389.571429,175.781805,75.984598
23.0,183.7,73661.0,172.478158,69.541791
24.0,250.428571,110314.0,165.90355,67.53943
25.0,244.5,82448.4,171.177153,76.334082
26.0,159.6,92140.888889,171.139189,78.765376
27.0,279.125,114739.166667,161.964803,74.711894


## Basic Math with Pandas
If desired, it’s entirely possible to perform mathematical operations, such as addition or division, on an entire column of our dataframe. 

 
Let's multiply the weight column by 2:

In [39]:
# Multiply all weight values by 2
df['Weight']*2


0      182.052185
1      144.801384
2      180.306819
3      127.012705
4      170.203778
          ...    
495    133.610396
496    160.800372
497    127.811550
498           NaN
499    103.900612
Name: Weight, Length: 500, dtype: float64

In [40]:
#Creating a new column 'Weight_2' that is going to be equal to df['Weight']*2
df['Weight_2'] = df['Weight']*2
df

Unnamed: 0,Id,Gender,Age,Income,Height,Weight,Weight_2
0,1,M,64.0,24141.0,,91.026092,182.052185
1,2,F,18.0,82523.0,189.533369,72.400692,144.801384
2,3,,40.0,37928.0,153.884926,90.153410,180.306819
3,4,F,47.0,83622.0,189.395822,63.506353,127.012705
4,5,M,43.0,120475.0,155.234732,85.101889,170.203778
...,...,...,...,...,...,...,...
495,496,M,52.0,143234.0,156.044812,66.805198,133.610396
496,497,M,,56023.0,159.155331,80.400186,160.800372
497,498,,44.0,45923.0,180.344630,63.905775,127.811550
498,499,F,40.0,72164.0,151.121012,,


## Concatenating DataFrames
Concatenating DataFrames refers to combining two or more DataFrames along a particular axis (either rows or columns) to create a single larger DataFrame. This is useful when we have data split across multiple DataFrames and we want to consolidate them into one for analysis or processing.

In Pandas, we can use the **concat()** function to concatenate DataFrames. This function provides various options to control how the concatenation should be performed. 

Let’s say we have two DataFrames, **df1** and **df2**, and we want to concatenate them vertically (along rows):

In [41]:
#Concatenate df1 and df2 vertically (along rows)

# Sample DataFrames
data1 = {'A': [1,2,3], 'B':[4,5,6]}
data2 = {'A': [7,8,9], 'B': [10,11,12]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

# Concatenate DataFrames vertically
concatenated_df = pd.concat([df1,df2],ignore_index = True)
print("Concatenated Dataframe:\n", concatenated_df)

Concatenated Dataframe:
    A   B
0  1   4
1  2   5
2  3   6
3  7  10
4  8  11
5  9  12


In [None]:
print(df1)
print(df2)

In this example, **pd.concat()** is used to concatenate df1 and df2 vertically into concatenated_df. The **ignore_index=True** argument ensures that the index is reset after concatenation.

We can also concatenate DataFrames **horizontally** by specifying **axis=1** as an argument to **pd.concat()**. This will merge the DataFrames along columns.

In [42]:
#Concatenate df1 and df2 horizontally

# Sample DataFrames
data1 = {'A': [1,2,3], 'B':[4,5,6]}
data2 = {'C': [7,8,9], 'D': [10,11,12]}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

# Concatenate DataFrames horizontally
concatenated_df_horizontal = pd.concat([df1,df2], axis = 1)
print("Horizontal Concatenated DataFrame:\n ", concatenated_df_horizontal)

Horizontal Concatenated DataFrame:
     A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12


### Exercise 4
Consider two DataFrames, df1 and df2, with the following data

**import pandas as pd**

**data1 = {'A': [1, 2, 3], 'B': [4, 5, 6]}**

**data2 = {'A': [7, 8, 9], 'B': [10, 11, 12]}**

**df1 = pd.DataFrame(data1)**

**df2 = pd.DataFrame(data2)**

What will be the output of the following code:

**result = pd.concat([df1, df2], axis=1)**

**print(result)**

Select the correct answer ***(without running the code)***:

a) The concatenated DataFrame with columns A, B, A, B

b) An error will occur because columns A and B are duplicated

c) The concatenated DataFrame with columns A, B, C, D



In [43]:
import pandas as pd

data1 = {'A': [1, 2, 3], 'B': [4, 5, 6]}

data2 = {'A': [7, 8, 9], 'B': [10, 11, 12]}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

df2.columns = ['C','D']
result = pd.concat([df1, df2], axis=1)

print(result)

   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12


## Saving Pandas DataFrame

We can save a Pandas DataFrame to various file formats using different methods provided by Pandas. Before we move forward with saving  a pandas dataframe, let’s first create a new directory called “Results” within the directory that contains your code.

Here are some commonly used methods to save a DataFrame:
-  CSV Format: To save a DataFrame to a CSV file, we can use the to_csv() method:


In [None]:
# Save DataFrame to CSV file
output_path = 'Results/output.csv'
df.to_csv(output_path, index = False)

This will save the DataFrame to a CSV file named ‘output.csv’ inside a directory called “Results”, without including the index.

-  Excel Format: To save a DataFrame to an Excel file, we can use the to_excel() method:

In [None]:
# Save DataFrame to Excel file
output_path = 'Results/output.xlsx'
df.to_excel(output_path, index = False)

This will save the DataFrame to an Excel file named ‘output.xlsx’  inside a directory called “Results”, without including the index.


-  Other Formats: Pandas supports various other formats, including JSON, Parquet, HDF5, and more. We can use the appropriate method based on the desired format:

    -   JSON: df.to_json("output.json", orient="records")
    -   Parquet: df.to_parquet("output.parquet")
    -   HDF5: df.to_hdf("output.h5", key="data")

Make sure to replace ‘output’ with your desired file name and extension.
