# PfDA Assignment

## Problem statement:

For this project you must create a data set by simulating a real-world phenomenon of your choosing. You may pick any phenomenon you wish – you might pick one that is of interest to you in your personal or professional life. Then, rather than collect data related to the phenomenon, you should model and synthesise such data using Python. We suggest you use the numpy.random package for this purpose.

Specifically, in this project you should:
- Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables.
- Investigate the types of variables involved, their likely distributions, and their relationships with each other.
- Synthesise/simulate a data set as closely matching their properties as possible.
- Detail your research and implement the simulation in a Jupyter notebook – the data set itself can simply be displayed in an output cell within the notebook.

Note that this project is about simulation – you must synthesise a data set. Some students may already have some real-world data sets in their own files. It is okay to base your synthesised data set on these should you wish (please reference it if you do), but the main task in this project is to create a synthesised data set. 

## Dataset Chosen

I have chosen to base my synthesised dataset on the  the passengers on board the RMS Titanic, including whether they survived the shipwreck or not.

The variables that I will include are;

- Sex
- Age
- Passenger Class
- Survived
- Port of Embarkation
- Siblings/Spouse onboard

## Types of Variables

<img src = Images/variable_type_infographic.PNG alt = "Data Types Infographic">

A variable is a characteristic that can be measured and that can assume different values. Height, age, income, province or country of birth, grades obtained at school and type of housing are all examples of variables. Understanding the types of variables in a dataset is crucial for performing appropriate analyses and choosing suitable machine learning algorithms for prediction tasks. Different types of variables require different statistical methods and visualization techniques for analysis.  Variables may be classified into two main categories: categorical and numeric. Each category is then classified in two subcategories: nominal or ordinal for categorical variables, discrete or continuous for numeric variables. 

A categorical variable (also called qualitative variable) refers to a characteristic that can’t be quantifiable. Categorical variables can be either nominal or ordinal.

  - A nominal variable is one that describes a name, label or category without natural order. Sex and type of dwelling are examples of nominal variables.

  - An ordinal variable is a variable whose values are defined by an order relation between the different categories.

A numeric or quantative variable is a quantifiable characteristic whose values are numbers.  These variables can be either continous or discrete.

  - A continous variable is one that can assume an infinite number of real values within a given interval.  Continous variables can be further categorised as either interval or ratio variables with one of the key differences being that a ration has a defined zero point.

  - A discrete variable can assume only a finite number of real values within a given interval.

In [None]:
# Import modules necessary for this task
import pandas as pd
import numpy as np

In [None]:
# Read in the existing dataset from the online source.

url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'

df = pd.read_csv(url)

In [None]:
# Display the dataset information
print(df.info())

In [None]:
# Display the first few rows of the synthetic dataset
print(df.head)

In [None]:
# Print a summary of the numerical variables
print(df.describe())

In [18]:
# Set random seed for reproducibility
np.random.seed(42)

# Number of samples in the dataset
num_samples = 200



# Generate synthetic features
sex = np.random.choice(['Male', 'Female'], size=200)
age = np.random.randint(low=0, high=100, size=200)
passenger_class = np.random.choice(['First', 'Second', 'Third'], size=200)
survived = np.random.choice(['Yes', 'No'], size=200)
port = np.random.choice(['Queenstown', 'Southampton', 'Cherbourg'], size=200)
sibling_spouse = np.random.randint(low=0, high=5, size =200)

# Generate synthetic target variable (response)
# Assuming a linear relationship with noise
#target = 2 * sex + 0.5 * age + np.random.normal(loc=0, scale=0.5, size=100)

# Create a DataFrame to store the synthetic dataset
synthetic_data = pd.DataFrame({
    'Sex': sex,
    'Age': age,
    'Class': passenger_class,
    'Survived': survived,
    'Port of Embarkation' : port,
    'Sibling/Spouse' : sibling_spouse})


# Save the synthetic dataset to a CSV file
synthetic_data.to_csv('synthetic_dataset.csv', index=False)


      Sex  Age  Class Survived Port of Embarkation  Sibling/Spouse
0    Male   62  First      Yes          Queenstown               4
1  Female   95  Third      Yes         Southampton               0
2    Male   51  Third      Yes          Queenstown               4
3    Male   95  First       No           Cherbourg               0
4    Male    3  First       No          Queenstown               3


In [None]:
# Display the first few rows of the synthetic dataset
print(synthetic_data.head())

In [None]:
# Display the dataset information
print(synthetic_data.info())

In [None]:
# Print a summary of the numerical variables
print(synthetic_data.describe())