###Today, we will:

1. Understanding Statistics in Data Science
2. Calculating and Explaining Mean, Mode, Median, variance, and standard deviation and other statistical measures
3. Apply the concept on a real-world data

Statistics in data science involves the collection, analysis, interpretation, and presentation of data. It provides the foundational tools and techniques to extract meaningful insights and make informed decisions from data. In data science, statistics are used to:

- Describe Data: Summarize the characteristics and properties of datasets.
- Make Predictions: Utilize data to forecast trends and outcomes.
- Validate Models: Assess the performance and accuracy of machine learning models.
- Inform Decision Making: Support business strategies and actions based on data-driven insights.

## Overview of the statistics Module in Python

In Python, the `statistics` module is part of the standard library and provides functions to perform basic statistical operations on numerical data. Here's an overview of what the `statistics` package includes and its purpose:

# Purpose of the statistics Module

The `statistics` module is designed to facilitate statistical computations and analysis in Python programs. It offers a set of functions that are essential for analyzing data distributions, making data-driven decisions, and performing statistical calculations in various domains, including data science, research, finance, and more.

# Functions Provided by the statistics Module

The `statistics` module includes functions for calculating common statistical measures:

- **mean**: Calculates the arithmetic mean (average) of data.
-**harmonic mean**: Calculates the harmonic (averaging rates/ratios). E.g. for speed, 60m/h
- **median**: Computes the median (middle value) of data.
- **mode**: Finds the mode (most common value) in data.
- **variance**: Computes the variance of a sample (sample is A subset of the population used to gather data and make inferences about the population. Where population is the whole group you are interested in studying or learning about).
- **stdev**: Computes the standard deviation of a sample.

Alternatively, Python offers the **numpy** library for performing the aforementioned statistical tasks. For more information, visit https://numpy.org/doc/stable/reference/routines.statistics.html

#Now we will perform various statistical tasks using Python libraries on the Titanic dataset. We will focus on working with text data in Comma-Separated Values (.csv) format.

##First, we will need to download the dataset from Google Drive

In [None]:
try:
    import gdown
except ImportError:
    !pip install gdown
    import gdown
import gdown

# File ID from your Google Drive link
file_id = '1b7_g2DfgBcJ3hy0sETAgLLF1wBbf_1bS'
# URL to download the file
url = f'https://drive.google.com/uc?id={file_id}'

# Download the file
dataset = 'titanic.csv'  # Specify the output file name
gdown.download(url, dataset, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1b7_g2DfgBcJ3hy0sETAgLLF1wBbf_1bS
To: /content/titanic.csv
100%|██████████| 106k/106k [00:00<00:00, 45.9MB/s]


'titanic.csv'

##Next export the dataset

In [None]:
#import necessary libraries
import pandas as pd #This line imports the pandas library, which is used for data manipulation and analysis. "pip install pandas" if pandas not install
file_path = 'titanic.csv' #the data is comma-seperated text (Comma Seperate Values) format
data = pd.read_csv(file_path)

##Now that your data is ready, lets perform some statistical analyses

###1. What are the columns in the dataset?

In [None]:
import pandas as pd

# Load the dataset
data_frame = pd.read_csv(dataset)

# Get the column names
column_names = data_frame.columns

# Print the column names
print(column_names)

Index(['ticket_class', 'survived', 'name', 'gender', 'age', 'num_siblings',
       'num_parents', 'ticket', 'fare', 'cabin', 'embarked', 'boat', 'body',
       'home.dest'],
      dtype='object')


##2 Clean the dataset by removing the empty column

In [None]:
import pandas as pd

# Load the dataset
data_frame = pd.read_csv('titanic.csv')

# Remove rows where any cell in the specified columns empty
data_frame_cleaned = data_frame.dropna(subset=['survived', 'gender', 'age', 'fare', 'num_siblings', 'num_parents'])

# Save the cleaned data to a new CSV file
data_frame_cleaned.to_csv('titanic_cleaned.csv', index=False)

###3. How many passengers are in the dataset before and after cleaning?

In [None]:
# Get the number of records
num_records_before = data_frame.shape[0]
num_records_after = data_frame_cleaned.shape[0]

# Print the number of records
print(num_records_before)
print(num_records_after)

1309
1045


###4. What is the average age of the passengers?

In [None]:
# Calculate the average age
average_age = data_frame['age'].mean()

# Print the average age
print(average_age)


29.8811345124283


###5. How many male and female passengers were on the Titanic?

In [None]:
# Count the number of male and female passengers
gender_counts = data_frame['gender'].value_counts()

# Print the counts
print(gender_counts)


gender
male      843
female    466
Name: count, dtype: int64


##6. What is the percentage of passengers who survived?
##7. What is the percentage of passengers who did not survived?
##8. What is the distribution of passenger classes (1st, 2nd, 3rd)?
##9. Add your analysis questions and answer from the data.
##10. Add your analysis questions and answer from the data.

6. What is the percentage of passengers who survived?

In [None]:
# prompt: What is the percentage of passengers who survived?

# Calculate the number of survivors
num_survivors = data_frame['survived'].sum()

# Calculate the total number of passengers
total_passengers = data_frame.shape[0]

# Calculate the survival percentage
survival_percentage = (num_survivors / total_passengers) * 100

# Print the survival percentage
print("Percentage of passengers who survived: {:.2f}%".format(survival_percentage))


Percentage of passengers who survived: 38.20%


# What is the distribution of passenger classes (1st, 2nd, 3rd)?

In [None]:
# prompt:  What is the distribution of passenger classes (1st, 2nd, 3rd)?

# Count the occurrences of each passenger class
class_distribution = data_frame['ticket_class'].value_counts()

# Print the class distribution
print(class_distribution)


ticket_class
3    709
1    323
2    277
Name: count, dtype: int64


What is the percentage of passengers who did not survived?

In [None]:
# prompt: What is the percentage of passengers who did not survived?

# Calculate the number of non-survivors
num_non_survivors = total_passengers - num_survivors

# Calculate the non-survival percentage
non_survival_percentage = (num_non_survivors / total_passengers) * 100

# Print the non-survival percentage
print("Percentage of passengers who did not survive: {:.2f}%".format(non_survival_percentage))


Percentage of passengers who did not survive: 61.80%


##Discuss your findings below:


Discuss here...