## EDA (Exploratory Data Analysis)

As the term “univariate” suggests, this session deals with analysing variables one at a time. It is important to separately understand each variable before moving on to analysing multiple variables together.

The broad agenda for this session is as follows:

1. Metadata description
2. Data distribution plots
3. Summary metrics

### Types of Variables

In our study of variables, we have encountered various types that play a crucial role in statistical analysis. Let's recap and categorize them for a clearer understanding.

1. Categorical Variables:
    - Ordered Categorical Variables: These variables exhibit a distinct order or hierarchy among their categories. For instance:
      Salary: High > Medium > Low
      Month: Jan < Feb < Mar, and so on.
    - Unordered Categorical Variables: These variables lack a specific order or hierarchy among their categories. Examples include:
      - Type of loan: Home, Personal, Auto
      - Organization department: Sales, Marketing, HR
2. Quantitative/Numeric Variables:
   These variables are numeric in nature and can undergo mathematical operations such as addition, multiplication, and division. Notable examples include:
    - Salary
    - Number of bank accounts
    - Runs scored by a batsman
    - Mileage of a car
    - Understanding the nature of these variables is fundamental to conducting meaningful statistical analyses. Categorical variables, in particular, can be further classified based on the presence or absence of a specific order among their categories. In contrast, quantitative variables provide us with numerical data that allows for more extensive mathematical manipulation.

As we delve deeper into statistical analysis, the distinction between these variable types will be crucial in choosing appropriate analytical techniques and drawing accurate conclusions from our data.

### Univariate Analysis of Variables

Now that we have established a foundational understanding of the types of variables in our dataset, the next step is to extract meaningful insights through univariate analysis. Univariate analysis involves examining individual variables in isolation to grasp their characteristics and patterns.

In the upcoming lectures, we will focus on conducting univariate analysis on the following types of variables:

- Categorical Variables:

    - Unordered Categorical Variables: Our exploration will encompass variables without a specific order among categories. Examples include types of loans (e.g., home, personal, auto) and organizational departments (e.g., sales, marketing, HR).

    - Ordered Categorical Variables: For variables with inherent ordering, such as salary levels (e.g., high, medium, low) and months (e.g., Jan, Feb, Mar), we will delve into techniques to extract insights from their ordered nature.

- Quantitative Variables:

    - Univariate analysis of quantitative variables involves exploring numeric data. We will apply various statistical measures and visualizations to understand the characteristics of variables like salary, the number of bank accounts, runs scored by a batsman, and the mileage of a car.
By dissecting each variable type separately, we can uncover valuable information about their distributions, central tendencies, and variability. This univariate analysis serves as a crucial preliminary step before diving into more complex multivariate analyses, allowing us to comprehend the individual nuances of each variable in our dataset.

### Stock Market dataset 
- Meta data

#### When to use linear scale and when to use log scale 


The decision to use a linear scale or a logarithmic (log) scale depends on the nature of the data and the goals of the analysis. Here are some considerations for when to use each type of scale:

Linear Scale:

- Data Distribution is Even: If your data is evenly distributed across a wide range and there are no extreme values, a linear scale may be appropriate. Linear scales are commonly used when the data spans a relatively small range and there are no orders of magnitude differences between data points.

- Interpretability: Linear scales are more intuitive for most people to interpret. The distance between two points on a linear scale represents the same absolute difference across the entire range.

- Relationships are Additive: Linear scales are suitable when the relationships between variables are additive. For example, if you're plotting the population growth over time, a linear scale may be appropriate if the growth is consistently additive.

Logarithmic Scale:

- Wide Range of Values: When your data spans several orders of magnitude, a logarithmic scale can compress the visual representation of the data, making it easier to discern patterns and variations.

- Skewed Data: In cases where the data is skewed or there are extreme values (outliers), a logarithmic scale can help to visualize the entire range of data without the extreme values dominating the visualization.

- Multiplicative Relationships: Logarithmic scales are useful when the relationships between variables are multiplicative rather than additive. For instance, if you're looking at exponential growth, a logarithmic scale can linearize the relationship.

- Percentage Changes: Log scales are often used when percentage changes are more meaningful than absolute changes, as equal percentage changes are represented by equal distances on the scale.

In summary, the choice between linear and logarithmic scales depends on the characteristics of your data and the specific goals of your analysis. It's essential to consider the scale that best represents the relationships within your data and facilitates clear interpretation and visualization.

#### Analysis of Unordered Categorical Variables


- Rank-Frequency Plots: The lecture underscores the significance of rank-frequency plots as a valuable technique for extracting meaning from unordered categorical variables.
- Non-Trivial Analysis: The shift in perspective highlights the possibility of conducting non-trivial analysis on seemingly straightforward variables.
- Plot Utilization: Plots are acknowledged as effective aids in the analysis process, demonstrating their capacity to reveal insights even in the absence of inherent order among categories.

The objective is not to overly focus on specific statistical concepts like power laws or rank-frequency plots, but rather to instill the understanding that plots play a pivotal role in unraveling the intricacies of unordered categorical variables. This newfound awareness opens avenues for more nuanced and insightful analyses in our exploration of categorical data.

# Quantitative Variables - Univariate Analysis

Univariate analysis of quantitative variables involves examining and understanding the characteristics of individual numeric variables within a dataset. Here's how you can conduct univariate analysis for quantitative variables. 

Central Tendency:
- Mean: Calculate the average of the quantitative variable. It gives you a measure of central tendency.
  Mean= Number of values/Sum of all values
- Median: Find the middle value of the dataset when it is ordered. It is less sensitive to extreme values than the mean.
- Mode: Identify the value that occurs most frequently in the dataset.

Dispersion or Spread:

- Range: The difference between the maximum and minimum values in the dataset.
  Range=Max Value−Min Value.
- Variance: Measure of how far each data point in the set is from the mean.
- Standard Deviation: Square root of the variance. It provides a more interpretable measure of the spread.
- Interquartile Range (IQR): Range covered by the middle 50% of the dataset, calculated as the difference between the third quartile (Q3) and the first quartile (Q1).
.in Valuees
​
ncy.

### What is the use the of univarte analysis.


Univariate analysis serves several crucial purposes in the field of statistics and data analysis. Here are some key uses of univariate analysis:

- Descriptive Statistics:
  - Univariate analysis provides descriptive statistics that summarize and describe the main features of a single variable. Measures such as mean, median, mode, range, and standard deviation offer insights into the central tendency and variability of the data.
- Data Exploration:
  - It helps in exploring the characteristics of individual variables, providing a foundational understanding of the data before moving on to more complex analyses.
- Data Cleaning:
  - Univariate analysis can highlight potential data issues, such as outliers or missing values, which can then be addressed in the data cleaning process.
- Visualization:
  - Visualization techniques, like histograms, box plots, and probability density functions, allow for a visual representation of the data distribution. This aids in identifying patterns, trends, and anomalies.
- Identifying Outliers:
  - Univariate analysis helps in detecting outliers, which are data points that significantly deviate from the overall pattern of the dataset. Identifying outliers is crucial for understanding data integrity and potential errors.
- Comparisons and Benchmarks:
  - Univariate analysis enables comparisons between different groups or subpopulations within a dataset. It serves as a starting point for understanding the characteristics of various segments.
- Variable Transformation:
  - Understanding the distribution of variables can guide decisions on variable transformations. For instance, if a variable is not normally distributed, transformation techniques may be applied to meet the assumptions of certain statistical methods.
- Statistical Inference:
  - Univariate analysis lays the groundwork for more advanced statistical analyses. The insights gained from examining a single variable can inform the choice of appropriate statistical tests and models for more complex multivariate analyses.
- Decision-Making:
  - In various fields, univariate analysis helps in making informed decisions by providing a clear picture of the key features of the data.
- Communication:
  - Results from univariate analysis are often easier to communicate to a broader audience, making it an essential step in conveying findings to stakeholders who may not have a strong statistical background.
In essence, univariate analysis is a fundamental step in the data analysis process, offering insights into individual variables that form the basis for more advanced and nuanced analyses. It aids in understanding data patterns, assessing data quality, and making informed decisions based on a comprehensive exploration of the data.exploration of the data.

### Outiler Detection
- Demo in excel 

#### When to include and exclued median in cal of quantile range 

The calculation of the Interquartile Range (IQR) involves the inclusion of the median in determining the spread of the middle 50% of the data. The IQR is defined as the difference between the third quartile (Q3) and the first quartile (Q1), and both Q1 and Q3 are calculated with the inclusion of the median.

Here's the breakdown:- 

Include Median in IQR Calculation:
    Q1 (First Quartile): The median of the lower half of the dataset, representing the value below which 25% of the data falls.
    Q3 (Third Quartile): The median of the upper half of the dataset, representing the value below which 75% of the data falls.
    IQR: Calculated as 
    Q3 − Q1
    Q3−Q1, including the median in the spread measure.
    The inclusion of the median ensures that the IQR is robust and resistant to extreme values, providing a more accurate representation of the central tendency and spread of the middle 50% of t- he data.

Exclude Median in Some Cases:
    In some specific contexts or statistical conventions, you may come across definitions of the IQR that exclude the median from the calculation. This might be more common in certain mathematical or statistical formulations.
    However, the more widely accepted definition of the IQR, especially in practical applications and data analysis, includes the median as an integral part of the calculation. The inclusion of the median contributes to the robustness of the IQR, making it a reliable measure of spread, particularly in the presence of outliers or skewed distributions.

In summary, when calculating the Interquartile Range (IQR) in standard practice, include the median in the computation to accurately represent the spread of the middle 50% of the dataset. 50% of the dataset.

# Segmented Univariate analysis

Agenda
- Basic of segmentation
- Comparison of average
- Comparison of otehr metrics

Eg : Batting average of all players in world

Lets look into NAS dataset

# Bivariate analysis 

Bivariate analysis is a statistical method used to examine the relationship between two variables. It focuses on understanding how changes in one variable are associated with changes in another variable. This analysis can help identify patterns, trends, and correlations between the two variables.

In brief, the process of bivariate analysis typically involves:

- Visualization: Plotting the data points on a graph such as scatter plots, line graphs, or histograms to visually inspect the relationship between the two variables.

- Correlation Analysis: Calculating correlation coefficients (such as Pearson correlation coefficient, Spearman's rank correlation coefficient) to quantify the strength and direction of the relationship between the two variables. A correlation coefficient close to +1 indicates a strong positive relationship, close to -1 indicates a strong negative relationship, and close to 0 indicates no linear relationship.

- Regression Analysis: Using regression techniques to model the relationship between the variables, typically with one variable as the dependent variable and the other as the independent variable. This helps in predicting the value of one variable based on the value of the other variable.

- Hypothesis Testing: Testing hypotheses about the relationship between the variables, such as whether the correlation coefficient is significantly different from zero, or whether the regression coefficients are significantly different from zero.

Overall, bivariate analysis provides insights into the association between two variables, helping researchers and analysts to understand and interpret their data better.

#### Types of Sampling Methods

Random Sampling: In this method, people are just selected randomly. This is similar to pulling names out of a hat.

 

Example: Suppose you want to find out the average internet usage per person in India. You just put the names of all the Indians in a hat, pull, say, 100 names out, and then calculate the average of these 100 Indians.

 

Stratified Sampling: Here, people are divided into subgroups and then selected randomly. But this is done in such a way that the final sample has the same proportions of these subgroups as does the population.

 

Example: Again, suppose you want to find out the average internet usage per person in India. Note that 70% of Indians live in rural areas, and 30% live in urban areas. So, you would put the names of all the rural Indians in hat A and the names of all the urban Indians in hat B. Then, you’d pull 70 names out of hat A and 30 names out of hat B. Now, again you’d have a sample of 100 Indians, but this time, your sample would be more representative of the population as its rural and urban proportions would be the same as that of the population.

 

Volunteer Sampling: Here, people who want to volunteer for any given survey form your sample.

 

Example: Suppose that once more, you want to find out the average internet usage per person in India. You could have everyone answer an online survey, which asks them how often/much they use the internet. You could ask the same question through a telephonic survey.

 

As you can see through this example, the good thing about this type of sampling is that it looks unbiased and random because the person who will take part in the survey is selected at random through the medium (internet, telephone) itself. There is no human interference. However, the medium will also bring in some bias. For example, an internet survey is more likely to have people with a high internet usage, whereas a telephone survey is a little more likely to have a balanced representation of heavy internet users and people who use the internet infrequently.

 

Opportunity Sampling: In this method, people around the surveyor form their sample space.

 

Example: This time, when you want to find out the average internet usage per person in India, you just ask 100 people around you about their internet usage.

 

Clearly, this sampling method has the potential to become extremely biased. The only good thing here, probably, is that this is a relatively convenient sampling method.

So, there are four typical cases in which sampling is generally used:

Market research: Suppose your company wants to launch a product that depends on people having a decent internet connection, such as Hotstar, Netflix, etc. For such a product, you need to first understand what the potential market size is. For this, you need to conduct a survey with some people and based on their data, infer parameters such as the average data usage, the willingness to adopt new technologies, etc. for the entire population.

Marketing campaign efficacy: Suppose you work for a company such as Hotstar, Netflix, etc. You want more and more people to move from your competitors’ platforms to your platform. You are planning to do this through a marketing campaign. But how should this marketing campaign be structured? How much should its budget be? What should the strategy used (free membership for a week/lower membership fees for a few weeks/etc.) be? You can use your past marketing campaigns’ data and your knowledge of sampling techniques to make these decisions.

Pilot testing: Again, let’s go to the Hotstar and Netflix example. Suppose you’ve done all the market research required, and you’ve developed the product. Now, before putting your product out there, you might want to give it a trial run. For this, you can perform what is called a pilot test. What this means is that instead of giving your product a full-fledged launch, you can just launch it partially for a few people. These people can test your product and help you decide whether it is good enough for the full launch.

Quality control: This is more of a manufacturing-centred application. Let’s say your company produces 10 million smartphones every year. This means that around 30,000 phones are produced every day. In such a situation, QA (quality assurance) becomes a function of utmost importance. Since it is difficult to check 30,000 phones every day, your company would just “sample” a few and then make decisions based on those samples.

# Hypothesis testing

The statistical analyses learnt in Inferential Statistics enable you try to make inferences about the population mean from the sample data when you have no idea of the population mean. However, sometimes you have some starting assumption about the population mean and you want to confirm those assumptions using the sample data. It is here that hypothesis testing comes into the picture. We will cover the basic concepts of hypothesis testing in this session, which are as follows:

Types of hypotheses

Types of tests

Decision criteria

Critical value method of hypothesis testing

Let’s understand the basic difference between inferential statistics and hypothesis testing.

 

Inferential statistics is used to find some population parameter (mostly population mean) when you have no initial number to start with. So, you start with the sampling activity and find out the sample mean. Then, you estimate the population mean from the sample mean using the confidence interval.

 

Hypothesis testing is used to confirm your conclusion (or hypothesis) about the population parameter (which you know from EDA or your intuition). Through hypothesis testing, you can determine whether there is enough evidence to conclude if the hypothesis about the population parameter is true or not.

 

Both these modules have a few similar concepts, so don’t confuse terminology used in hypothesis testing with inferential statistics.

Hypothesis Testing starts with the formulation of these two hypotheses:

Null hypothesis (H₀): The status quo

Alternate hypothesis (H₁): The challenge to the status quo

### making a decision

You can tell the type of the test and the position of the critical region on the basis of the ‘sign’ in the alternate hypothesis.


       ≠ in H₁    →   Two-tailed test        →     Rejection region on both sides of distribution

       < in H₁    →   Lower-tailed test     →     Rejection region on left side of distribution

       > in H₁    →   Upper-tailed test     →     Rejection region on right side of distribution

# Critical Value Method

# Hypothesis Testing II 

There are various methods similar to the critical value method to statistically make your decision about the hypothesis. In this session, you will study one such method, which is called the p-value method. This is an important method and is used more frequently in the industry.

The broad agenda for this session is as follows:

The p-value method of hypothesis testing
Types of errors in hypothesis testing


The p-value method is a statistical technique used in hypothesis testing to determine the significance of results. It's particularly prevalent in frequentist statistical analysis.

Here's a basic outline of how the p-value method works:

Formulate Hypotheses: Begin by formulating a null hypothesis (H0) and an alternative hypothesis (H1). The null hypothesis typically states that there is no effect or no difference, while the alternative hypothesis contradicts the null hypothesis by asserting that there is some effect or difference.

Select a Test Statistic: Choose an appropriate test statistic that measures the difference between the observed data and what is expected under the null hypothesis.

Determine the Distribution of the Test Statistic Under H0: Assuming the null hypothesis is true, determine the distribution of the test statistic. This distribution is often known or can be approximated.

Calculate the p-value: The p-value is the probability of observing a test statistic at least as extreme as the one actually observed, assuming that the null hypothesis is true. It quantifies the strength of evidence against the null hypothesis.

If the p-value is small (usually below a pre-defined threshold, such as 0.05), it suggests that the observed data is unlikely to have occurred if the null hypothesis were true, leading to the rejection of the null hypothesis in favor of the alternative hypothesis.

If the p-value is large, it suggests that the observed data is reasonably likely to occur under the null hypothesis, so there is not enough evidence to reject the null hypothesis.

Make a Decision: Based on the p-value and the pre-defined significance level (alpha), decide whether to reject the null hypothesis. If the p-value is less than alpha, reject the null hypothesis; otherwise, fail to reject it.

It's important to note that the p-value method does not provide information about the size or importance of an effect. It only assesses the strength of evidence against the null hypothesis.

Additionally, while p-values are commonly used, they have been subject to criticism and misuse in statistical practice. It's essential to interpret them correctly and consider other factors, such as effect size and study design, when drawing conclusions from statistical analyses.

Scenario: Suppose a pharmaceutical company has developed a new drug intended to lower blood pressure. They want to test whether the drug is effective, so they conduct a clinical trial comparing blood pressure measurements before and after administering the drug to a sample of patients.

Hypotheses:

Null Hypothesis (H0): The drug has no effect on blood pressure; there is no difference in blood pressure before and after administering the drug.
Alternative Hypothesis (H1): The drug lowers blood pressure; there is a difference in blood pressure before and after administering the drug.
Sample: The company selects a random sample of 30 patients and measures their blood pressure before administering the drug. After a month of treatment, they measure their blood pressure again.

Test Statistic: In this case, a common test statistic used to compare two means is the t-statistic, which is calculated using the difference in means between the before and after measurements, adjusted for sample size and variability.

Procedure:

The company collects blood pressure data before and after administering the drug to the 30 patients in the sample.
They calculate the mean difference in blood pressure before and after administering the drug.
They also calculate the standard deviation of the differences to estimate the variability.
Using these values, they calculate the t-statistic.
They determine the distribution of the t-statistic under the null hypothesis (H0), which follows a t-distribution with 
n−1 degrees of freedom, where 
n is the sample size.
Finally, they calculate the p-value, which is the probability of observing a t-statistic as extreme as the one calculated, assuming the null hypothesis is true.
Example: Let's say the calculated t-statistic is 2.5, and the degrees of freedom are 29. From the t-distribution table or statistical software, the company finds that the probability of observing a t-statistic as extreme as 2.5 or more extreme under the null hypothesis is approximately 0.015.

Decision:

If the significance level (alpha) is set at 0.05 (commonly used), since 0.015 is less than 0.05, the company rejects the null hypothesis.
They conclude that there is sufficient evidence to suggest that the drug has a significant effect on lowering blood pressure in the sample population.
This is a simplified example, but it illustrates the basic steps of using the p-value method in hypothesis testing. Real-world applications may involve more complex statistical techniques and considerations.






Types of error

 ![4.png](attachment:4.png)

![image.png](attachment:image.png)