# Introduction to Basic Terminologies & Statistics for Data Science

---

## Quick Summary

Data Science is the process of turning raw data into useful knowledge and decisions.  
To do this, we:

1. Collect and clean data  
2. Explore and analyze it  
3. Build models to represent patterns  
4. Use statistics to describe data and make conclusions  

Key ideas in this topic:
- Data vs Dataset
- Variables (qualitative and quantitative)
- Knowledge representation (especially models)
- Population vs Sample
- Descriptive vs Inferential statistics
- Mean, Median, Mode
- Range, Variance, Standard Deviation

---

# 1. What is Data Science?

**Data Science** is a multidisciplinary field that extracts useful insights from data.

It combines:
- Statistics
- Programming
- Machine Learning
- Domain knowledge

A simplified pipeline:

Raw Data → Clean & Prepare → Analyze → Model → Insight → Decision

Data science is not just about algorithms. It includes:
- Cleaning messy data
- Understanding the business problem
- Communicating results clearly

---

# 2. Data Science vs Related Fields

## Data Science vs Data Analytics

- **Data analysts** analyze existing data to answer questions.
- **Data scientists** may create new models, methods, or tools.

Analysts focus more on reporting and interpretation.  
Data scientists focus more on modeling and prediction.

---

## Data Science vs Data Engineering

- **Data engineers** build systems to collect, store, and move data.
- They create pipelines and manage infrastructure.
- **Data scientists** use this prepared data to build models.

Think of it like:
- Engineers build the roads.
- Scientists drive on them to reach insights.

---

## Data Science vs Data Engineering

- **Data engineers** build systems to collect, store, and move data.
- They create pipelines and manage infrastructure.
- **Data scientists** use this prepared data to build models.

Think of it like:
- Engineers build the roads.
- Scientists drive on them to reach insights.

---

## Data Science vs Machine Learning

Machine Learning (ML) is a **tool** inside Data Science.

- ML trains systems to learn patterns from data.
- Data Science may use ML, but also includes statistics, data cleaning, and business understanding.

Not all Data Science problems require Machine Learning.

---

## Data Science vs Statistics

Statistics:
- A mathematical field focused on analyzing numerical data.
- Concerned with probability, hypothesis testing, uncertainty.

Data Science:
- Broader field.
- Uses statistics + computing + modeling.

All data scientists use statistics.  
Not all statisticians work in data science.

---


# 3. What is Data?

**Data** is anything that is recorded.**

Examples:
- Online purchases
- Sensor readings
- Bank transactions
- Social media clicks
- Medical records

---


# 4. Dataset, Records, and Attributes

A **dataset** is a structured collection of data.

It consists of:

- **Rows (Records)** → Each row represents one object/person/event.
- **Columns (Attributes/Variables)** → Each column represents one property.

Example structure:

| Person | Age | Income | Marital Status |
|--------|-----|--------|----------------|
| A      | 25  | 30K    | Single         |

Here:
- A row = one person
- A column = one attribute

---

# 5. Variables (Attributes)

A **variable** is a single measurable property of an object.

Examples:
- Age
- Temperature
- Cylinder count
- Heart rate
- Marital status

---

# 6. Types of Variables

Variables are divided into two main types:

## A. Qualitative (Categorical)

These represent categories or labels.

### 1. Nominal
- No natural order.
- Example: Color, Country, Animal type.
- You cannot perform meaningful arithmetic.

### 2. Ordinal
- Has order, but differences are not measurable.
- Example: Small < Medium < Large
- Order matters, but distance does not.

### 3. Binary
- Only two possible values.
- Example: Yes/No, 0/1, True/False.

---

## B. Quantitative (Numeric)

These represent measurable numerical values.

### 1. Discrete
- Countable numbers.
- Example: Number of children, number of emails.

### 2. Continuous
- Measured values with decimals.
- Example: Height, weight, temperature.

---

## Interval vs Ratio Scale

This distinction is important:

### Interval Scale
- Differences are meaningful.
- Zero does NOT mean absence.
- Example: Celsius temperature (0°C is not “no temperature”).

### Ratio Scale
- Differences AND ratios are meaningful.
- Zero means absence.
- Example: Weight (0 kg means no weight).
- 10 kg is twice 5 kg.

---

In [3]:
# haven't checked from here starting from 7
print("hellow world")

hellow world


# 7. From Data to Knowledge

Knowledge means discovering meaningful patterns from data.

Example:

| Animal | Body Mass | Heart Rate |
|--------|-----------|------------|
| Mouse  | Small     | Fast       |
| Whale  | Large     | Slow       |

Hidden knowledge:
- Smaller animals tend to have faster heartbeats.

This pattern is more useful than raw numbers.

---

# 8. Knowledge Representation

We can represent knowledge in three main ways:

## 1. Proposition (Statement)
"Smaller animals have faster heartbeats."

## 2. Narrative
A descriptive explanation of the pattern.

## 3. Model (Most Important in Data Science)
A mathematical or computational formula that describes the relationship.

Example model:
r = 235 × m^(-1/4)

Where:
- r = heart rate
- m = body mass

Models allow prediction.

---

# 9. Statistical Analysis

Statistical analysis means interpreting data to find patterns or trends.

Usually, we do not analyze the entire population.  
We analyze a **sample**.

---

# 10. Population and Sample

## Population
The entire group we are interested in.

Example:
- All students in a university.

## Sample
A subset of the population used for analysis.

Example:
- 200 students selected from the university.

A good sample:
- Represents the population fairly.
- Contains relevant information.

Bad sampling leads to wrong conclusions.

---

# 11. Types of Statistics

## A. Descriptive Statistics
Describes what the data looks like.

Answers:
- What is typical?
- How spread out is it?

Includes:
- Mean
- Median
- Mode
- Range
- Variance
- Standard Deviation

---

## B. Inferential Statistics
Makes conclusions about the population using a sample.

Answers:
- Is this claim likely true?
- Is the difference significant?
- What is the probability?

Used for:
- Hypothesis testing
- Estimation
- Decision-making under uncertainty

---

# 12. Measures of Central Tendency

These describe the "center" of the data.

---

## Mean (Average)

Mean = (Sum of values) / (Number of values)

- Sensitive to outliers.
- Good when data is symmetric.

---

## Median

The middle value after sorting.

- More robust than mean.
- Better when data has outliers.

---

## Mode

The most frequent value.

- Useful for categorical data.
- Can have more than one mode.

---

# 13. Measures of Spread

These describe how dispersed the data is.

---

## Range

Range = Maximum − Minimum

- Very simple.
- Sensitive to extreme values.

---

## Variance

Measures average squared distance from the mean.

Idea:
1. Find mean.
2. Subtract mean from each value.
3. Square the differences.
4. Average them.

Higher variance → more spread.

---

## Standard Deviation

Standard Deviation = √(Variance)

- Same unit as data.
- Easier to interpret than variance.
- Small SD → data clustered near mean.
- Large SD → data widely spread.

---

# 14. Inferential Statistics Example

Suppose we want to test:

"Cars with 6 cylinders have horsepower greater than 95."

We:
1. Take a sample.
2. Calculate statistics.
3. Use hypothesis testing.
4. Decide whether the evidence supports the claim.

This is inferential statistics.

---

# Final Conceptual Flow

Data → Dataset → Variables → Patterns → Models → Statistics → Decisions

Descriptive statistics help us understand data.  
Inferential statistics help us generalize beyond data.

---