1.What is statistics, and why is it important?

Statistics is the science of collecting, organizing, analyzing, interpreting, and presenting data. It helps us make informed decisions based on data patterns and trends. It’s essential in fields like healthcare, economics, business, social science, and more.

Importance:

Enables data-driven decision-making.

Helps summarize large datasets meaningfully.

Supports research and policy development.

Aids in predicting future outcomes (e.g., market trends, weather forecasting).

2.What are the two main types of statistics?

The two main types are:

Descriptive Statistics – Summarizes and describes features of a dataset.

Inferential Statistics – Makes predictions or generalizations about a population based on a sample.

3.What are descriptive statistics?

Descriptive statistics are methods for summarizing data using:

Measures of central tendency (mean, median, mode)

Measures of dispersion (range, variance, standard deviation)

Graphs and tables (bar charts, histograms)

Example: Reporting the average test score of a class.

4.What is inferential statistics?

Inferential statistics allows conclusions to be drawn about a larger population from a sample. It uses:

Hypothesis testing

Confidence intervals

Regression analysis

Example: Predicting election outcomes based on a voter sample.



5.What is sampling in statistics?

Sampling is the process of selecting a subset of individuals from a population to estimate characteristics of the whole population. It saves time and resources while still providing meaningful insights.

6.What are the different types of sampling methods?

Sampling methods include:

Random Sampling Methods:

Simple random sampling: Every individual has an equal chance.

Stratified sampling: Population divided into subgroups; samples taken from each.

Systematic sampling: Every kth item is selected.

Cluster sampling: Entire groups (clusters) are selected randomly.

Non-Random Sampling Methods:

Convenience sampling: Based on ease of access.

Judgmental sampling: Based on the researcher’s judgment.

Snowball sampling: Existing subjects help recruit more.

7.What is the difference between random and non-random sampling?
Random Sampling: Each member has a known and equal chance. Reduces bias.

Non-Random Sampling: Selection is subjective or based on convenience. Can introduce bias.



8.Define and give examples of qualitative and quantitative data.

Qualitative Data (Categorical): Describes qualities or categories.

Examples: Colors, gender, marital status.

Quantitative Data (Numerical): Represents measurable quantities.

Examples: Height, age, test scores.

9.What are the different types of data in statistics?

Data types include:

Qualitative (Categorical): Nominal, Ordinal

Quantitative (Numerical): Discrete, Continuous

10.Explain nominal, ordinal, interval, and ratio levels of measurement.

Nominal: Categories with no order. (e.g., blood type)

Ordinal: Ordered categories. (e.g., satisfaction ratings)

Interval: Ordered, equal spacing, no true zero. (e.g., temperature in °C)

Ratio: Like interval, but has a true zero. (e.g., weight, height)

11.What is the measure of central tendency?

These are values that represent the center or average of a dataset. They help summarize data with a single value.

12.Define mean, median, and mode.

Mean: Arithmetic average.

Median: Middle value when data is ordered.

Mode: Most frequent value in the dataset.

13.What is the significance of the measure of central tendency?

It provides a quick summary of a dataset and helps compare different datasets. Useful for understanding the general trend or behavior of data.

14.What is variance, and how is it calculated?

Variance measures the average squared deviation from the mean.

Formula:

Variance
(
𝜎
2
)
=
∑
(
𝑥
𝑖
−
𝜇
)
2/
𝑁



Where:



x
i
​
  = each value


μ = mean


N = number of values

15.What is standard deviation, and why is it important?

Standard deviation is the square root of the variance. It tells how spread out the values are around the mean.

Importance:

Easier to interpret than variance.

Helps assess data consistency.

16.Define and explain the term range in statistics.

Range = Maximum value – Minimum value
It’s the simplest measure of spread, showing the total spread of the data.

17.What is the difference between variance and standard deviation?
Variance: Average squared deviation.

Standard Deviation: Square root of variance; in original units, making it easier to interpret.

18.What is skewness in a dataset?

Skewness measures the asymmetry of the distribution.

Positive skew: Tail on the right.

Negative skew: Tail on the left.

Zero skew: Symmetrical distribution.

19.What does it mean if a dataset is positively or negatively skewed?

Positively Skewed: Most values are on the lower end; few large values.

Negatively Skewed: Most values are high; few small values.

20.Define and explain kurtosis.

Kurtosis measures the "tailedness" of a distribution:

Leptokurtic: Heavy tails (more outliers).

Mesokurtic: Normal distribution.

Platykurtic: Light tails (fewer outliers).

21.What is the purpose of covariance?

Covariance indicates the direction of a linear relationship between two variables:

Positive: variables move in the same direction.

Negative: variables move in opposite directions.

22. What does correlation measure in statistics?

Correlation measures both the strength and direction of a linear relationship between two variables, typically using the Pearson correlation coefficient (r):

Ranges from -1 to +1.

0 means no correlation.

23. What is the difference between covariance and correlation?

Covariance: Indicates direction only; values are unbounded.

Correlation: Standardized version of covariance; bounded between -1 and 1.

24. What are some real-world applications of statistics?

Healthcare: Drug testing, disease prediction.

Business: Customer behavior analysis, quality control.

Education: Evaluating student performance.

Government: Census, economic planning.

Sports: Player performance, game strategy.

**PRACTICAL QUESTIONS**

1. How do you calculate the mean, median, and mode of a dataset

In [None]:
import statistics

data = [12, 15, 15, 17, 19, 20, 20, 20, 25]

mean = statistics.mean(data)
median = statistics.median(data)
mode = statistics.mode(data)

print(f"Mean: {mean}, Median: {median}, Mode: {mode}")


2. Write a Python program to compute the variance and standard deviation of a dataset

In [None]:
import statistics

data = [10, 12, 23, 23, 16, 23, 21, 16]

variance = statistics.variance(data)
std_dev = statistics.stdev(data)

print(f"Variance: {variance}, Standard Deviation: {std_dev}")


3. Create a dataset and classify it into nominal, ordinal, interval, and ratio


In [None]:
dataset = {
    "Nominal": ["Red", "Blue", "Green"],
    "Ordinal": ["Low", "Medium", "High"],
    "Interval": [10, 20, 30],  # Temperature in Celsius
    "Ratio": [5, 10, 15]       # Weight in kg
}
print(dataset)


4. Implement sampling techniques like random sampling and stratified sampling

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Sample dataset
df = pd.DataFrame({
    "Gender": ["Male", "Female"] * 5,
    "Score": [45, 48, 52, 55, 60, 65, 68, 70, 75, 80]
})

# Random sampling
random_sample = df.sample(n=4)

# Stratified sampling based on gender
stratified_sample = df.groupby('Gender', group_keys=False).apply(lambda x: x.sample(frac=0.5))

print("Random Sample:\n", random_sample)
print("\nStratified Sample:\n", stratified_sample)


5.Write a Python function to calculate the range of a dataset

In [None]:
def calculate_range(data):
    return max(data) - min(data)

data = [3, 7, 2, 9, 5]
print("Range:", calculate_range(data))


6. Create a dataset and plot its histogram to visualize skewness

In [None]:
import matplotlib.pyplot as plt
import numpy as np

data = np.random.exponential(scale=2, size=1000)

plt.hist(data, bins=30)
plt.title("Histogram to Show Skewness")
plt.xlabel("Values")
plt.ylabel("Frequency")
plt.show()


 7. Calculate skewness and kurtosis of a dataset using Python libraries

In [None]:
from scipy.stats import skew, kurtosis

data = np.random.normal(loc=0, scale=1, size=1000)

print("Skewness:", skew(data))
print("Kurtosis:", kurtosis(data))


8. Generate a dataset and demonstrate positive and negative skewness

In [None]:
positive_skew = np.random.exponential(scale=2, size=1000)
negative_skew = -1 * np.random.exponential(scale=2, size=1000)

plt.hist(positive_skew, bins=30)
plt.title("Positive Skew")
plt.show()

plt.hist(negative_skew, bins=30)
plt.title("Negative Skew")
plt.show()


9. Write a Python script to calculate covariance between two datasets

In [None]:
import numpy as np

x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

cov_matrix = np.cov(x, y)
print("Covariance:", cov_matrix[0, 1])


10.Write a Python script to calculate the correlation coefficient between two datasets

In [None]:
correlation = np.corrcoef(x, y)
print("Correlation Coefficient:", correlation[0, 1])


11. Create a scatter plot to visualize the relationship between two variables

In [None]:
plt.scatter(x, y)
plt.title("Scatter Plot")
plt.xlabel("X")
plt.ylabel("Y")
plt.show()


12. Implement and compare simple random sampling and systematic sampling

 13.Calculate the mean, median, and mode of grouped data

In [None]:
import pandas as pd

# Class intervals and frequencies
data = {
    "Class": ["0-10", "10-20", "20-30", "30-40", "40-50"],
    "Frequency": [5, 8, 15, 16, 6]
}

df = pd.DataFrame(data)
df["Midpoint"] = [5, 15, 25, 35, 45]
df["Fx"] = df["Midpoint"] * df["Frequency"]

mean_grouped = df["Fx"].sum() / df["Frequency"].sum()
print("Grouped Mean:", mean_grouped)


Grouped Mean: 27.0


14.Simulate data using Python and calculate its central tendency and dispersion.

In [None]:
import numpy as np
import statistics

data = np.random.normal(loc=50, scale=10, size=100)

mean = statistics.mean(data)
median = statistics.median(data)
mode = statistics.mode(data.round())  # round for discrete mode
variance = statistics.variance(data)
std_dev = statistics.stdev(data)

print(f"Mean: {mean}, Median: {median}, Mode: {mode}, Variance: {variance}, Std Dev: {std_dev}")


In [None]:
15. Use NumPy or pandas to summarize a dataset’s descriptive statistics

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import zscore, skew, kurtosis, pearsonr, spearmanr


16.Plot a boxplot to understand the spread and identify outliers

In [None]:
# Generate example dataset
np.random.seed(42)
data = pd.DataFrame({
    'A': np.random.normal(50, 10, 100),
    'B': np.random.normal(60, 15, 100)
})

print(data.describe())


17.Calculate the interquartile range (IQR) of a dataset

In [None]:
# Boxplot
sns.boxplot(data=data)
plt.title("Boxplot of A and B")
plt.show()

# IQR Calculation
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
print(f"IQR:\n{IQR}")


18.Implement Z-score normalization and explain its significance

In [None]:
# Z-score
z_normalized = data.apply(zscore)
print("Z-score normalized data:\n", z_normalized.head())

# Significance: standardizes values, useful for comparison and outlier detection.


19.Compare two datasets using their standard deviation

In [None]:
std_A = np.std(data['A'], ddof=1)
std_B = np.std(data['B'], ddof=1)
print(f"Standard Deviation of A: {std_A:.2f}, B: {std_B:.2f}")


20.Write a Python program to visualize covariance using a heatmap

In [None]:
cov_matrix = data.cov()
sns.heatmap(cov_matrix, annot=True, cmap='coolwarm')
plt.title("Covariance Heatmap")
plt.show()


21. Use seaborn to create a correlation matrix for a dataset

In [None]:
corr_matrix = data.corr()
sns.heatmap(corr_matrix, annot=True, cmap='viridis')
plt.title("Correlation Matrix")
plt.show()


22.Generate a dataset and implement both variance and standard deviation computations

In [None]:
var = data.var()
std = data.std()
print(f"Variance:\n{var}\n\nStandard Deviation:\n{std}")


23.Visualize skewness and kurtosis using Python libraries like matplotlib or seaborn.

In [None]:
for col in data.columns:
    sns.histplot(data[col], kde=True)
    plt.title(f"{col} - Skew: {skew(data[col]):.2f}, Kurtosis: {kurtosis(data[col]):.2f}")
    plt.xlabel(col)
    plt.show()


24.5> Implement the Pearson and Spearman correlation coefficients for a dataset.

In [None]:
pearson_corr, _ = pearsonr(data['A'], data['B'])
spearman_corr, _ = spearmanr(data['A'], data['B'])
print(f"Pearson Correlation: {pearson_corr:.2f}")
print(f"Spearman Correlation: {spearman_corr:.2f}")
