# Basics of Data

### What is Data?
Data refers to any piece of information that can be collected, stored, and processed. In the context of machine learning, data is the raw material that algorithms use to learn patterns and make predictions. Data can come in various forms, including text, images, audio, video, and more.

## Types of Data:
###  Numerical Data:

Numerical data represents quantities and can be measured. It can be further categorized into continuous and discrete data.
* Continuous Data: Continuous data can take any value within a certain range. Examples include height, weight, temperature, and age.
* Discrete Data: Discrete data consists of distinct, separate values. Examples include the number of pets, the count of items, and the number of people.

### Categorical Data:

Categorical data represents qualities or characteristics that have distinct categories. It can be further classified into nominal and ordinal data.

* Nominal Data: Nominal data doesn't have any inherent order. Examples include colors, genders, and types of animals.
* Ordinal Data: Ordinal data has a clear order or ranking between categories. Examples include education levels (e.g., high school, college, graduate) or customer satisfaction ratings (e.g., very satisfied, satisfied, dissatisfied).

### Data Preprocessing:
Data preprocessing is a critical step in preparing your data for machine learning. Raw data often contains noise, inconsistencies, and missing values, which can lead to inaccurate models. Data preprocessing involves cleaning and transforming the data to make it suitable for analysis.

* Data Cleaning:

Data cleaning involves identifying and rectifying errors or inconsistencies in the dataset.

In [None]:
import pandas as pd

data = pd.read_csv('data/data.csv')

# Remove duplicates
data = data.drop_duplicates()

# Handle outliers (example: removing values above a certain threshold)
data = data[data['age'] < 100]


* Data Normalization:

Data normalization (also called data scaling) is the process of transforming data to have a consistent scale.

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
data['age'] = scaler.fit_transform(data[['age']])

* Hiding missing values
Missing values are gaps in the data where no value is recorded for a certain variable.

In [None]:
# drop rows with missing values
data = data.dropna()

# Filling missing values with mena value
data['income'].fillna(data['income'].mean(), inplace=True)

# Exploratory Data Analysis (EDA):

### Understanding Your Data Through Summary Statistics:
Summary statistics provide a concise overview of key properties of your data. These statistics help you understand the distribution, central tendency, variability, and potential anomalies in your dataset.

Common summary statistics include:

* Mean: The average value of a variable.
* Median: The middle value of a variable when the data is sorted.
* Standard Deviation: A measure of how spread out the values are around the mean.
* Minimum and Maximum: The smallest and largest values in the dataset.
* Percentiles: Values that divide the data into specified percentiles. For example, the 25th percentile (Q1) is the value below which 25% of the data falls.

In [None]:
# Calculate summary statistics

mean_age = data['age'].mean()
median_income = data['income'].median()
std_dev_height = data['height'].std()

print(f"Mean age: {mean_age}")
print(f"Median Income: {median_income}")
print(f"Standard Deviation of Height: {std_dev_height}")

### Vizualization of EDA:
Visualizations help you gain a deeper understanding of your data by providing a graphical representation of its distribution, relationships, and trends. Here are three important types of visualizations often used in EDA:

- Histogram:
A histogram is a graphical representation of the distribution of numerical data. It divides the data into bins and displays the frequency or count of data points in each bin.