## Historical Data and Decision Making 

Central to statistics is the ability to acquire meaning from collected data. To make sense of chance and unforeseen occurrences, humans have grappled with understanding the likelihood of events for millennia. Data distributions allow scientists to visualize data and draw meaning from their observations. In contrast to simply visualizing data, statistics enable data scientists to form equations that can further their understanding. Computing attributes like variance or skewness of a distribution deepens the understanding of complex problems. 

A vital component to quickly computing statistical concepts is using programs like R or Octave. For this example, this analysis will use R to draw more meaning from examples and provide context to the explanations. Using a dataset of historical data about stock prices, this analysis will explore various concepts. The data set includes rows for skewness, median, mean, standard deviation, and last price. Below is R code that will load the dataset into a new project and appropriately name the headers; note header = FALSE in the read.csv command; if this is not included, R will interpret the first row of data as the header for each row. 


In [None]:
> historicaldata <- read.csv("C:\\Users\\lisaj\\Downloads\\dataG2.csv", header = FALSE)
> colnames(historicaldata) <- c("Skewness", "Median", "Mean", "Standard_Deviation", "Last_Price")


### Understanding Variance and Skewness

Variance is a way to understand how spread-out data is (Nield, 2022). Variance answers how close or far apart values are in a distribution (Bruce & Bruce, 2019). Data scientists may be tasked with measuring variance to help make more informed decisions based on data distribution. Evaluating whether variability is real or random is a valuable tool when interpreting distributions, their meanings, and the validity of the information produced. Below is an example of variance and the R code that simplifies the computation of this statistical concept. 

![image.png](attachment:image.png)


In [None]:
> Variance_last_price <- var(historicaldata$Last_Price)
> cat("Variance of Last Price", Variance_last_price, "\n")
The variance of Last Price 0.1113323

A distribution may “skew” to one side or another when the tail of a distribution is longer than the other (Bruce & Bruce, 2019). Skewness is how we measure the asymmetry of the distribution. A positive skew would mean that the distribution skews to the right, and data points could be observed on the right side of the distribution. A negative skew would imply that the distribution skews to the left; data points could be observed on the left side of a distribution. Zero skewness would indicate that a distribution is perfectly symmetrical. Below is the formula to calculate the skewness of a distribution, followed by R commands that simplify the computation. 

![image.png](attachment:image.png)


In [None]:
> library(e1071)
> skewnesslastprice <- skewness(historicaldata$Last_Price)
> cat("Skewness of Last Price", skewnesslastprice, "\n")
Skewness of Last Price -0.2427279

### Interpreting Data in Context

The example for this analysis has labeled the columns of data according to their meaning. Data may be misinterpreted without adding context to these individual columns. For data analysis, context is key to drawing insights and driving effective decision-making. Let us further investigate the meaning of the data in each column to further the understanding of this analysis. Columns represent the skewness, median, mean, standard deviation, and the last price of individual stocks. 

In our data sets “skewness” column, we see negative and positive values represented. The lower the negative value, the more the skew of that stock price is to the left and the potentially lower returns. That stock may have experienced losses. The higher the positive value, the more the skew of that stock is to the right and the potentially higher the returns. That stock may have experienced gains. This data may be valuable to investors, as they may be interested in the past returns of that data in terms they can understand. 

The median column in the dataset represents the middle of that stock distribution. Fifty percent of the values are below this value, and fifty percent are above this value. This is key to understanding because stocks may have volatility, meaning they may have sharp increases or decreases that may affect other insights. Outliers would not affect the median. Those interested in a particular stock may find this insight meaningful in context; they may accept some risks, significantly if that specific stock was affected by circumstances beyond their control for a short period. For example, COVID-19 affected stock prices due to transportation issues or material shortages; the median could help investors understand better the added context.

The mean column represents the average price of that stock. Outliers can affect the mean but indicate the stock's average price within the data interpretation. Investors may be interested in what they can expect to see. If the cost of a stock is lower than average, they may interpret it as a suitable time to purchase that stock. They may also be interested in comparing the mean to other columns in the data set within a context. 

The standard deviation column illustrates the variability of stock prices relative to the mean of the stock. This illustrates how close values are to the average of the stock. For example, different number sets could have the same mean; the standard deviation can help differentiate how the values deviate from the mean. A low standard deviation tells you that values are close to the mean, and a high standard deviation tells you that values are spread further from the mean. In context, the standard deviation can be valuable for quickly interpreting whether a stock has price variability. Evaluating the standard deviation and the other column values can provide a well-rounded view of the stock’s performance. 

The last price column is valuable because it provides context for those interpreting stock data. It gives an immediate look at the stock’s current standing. It provides further context for the mean, standard deviation, skewness, and median. If the stock price is close to the mean and median, the standard deviation is low, and the distribution skewness is to the right, it can illustrate that the stock tends to give an investor a lower risk. A higher-risk stock could have a high standard deviation and still skew to the right; the mean could be higher than the stock’s median, the skew could still be a positive value, and the last price could be higher than the mean and median. This column gives investors perspective on analyzing the stock’s current standing in relation to the other available data.

### Drawing Conclusions from Data

Concluding data in meaningful ways that explain the data in context is essential to statistical analysis. For instance, subtracting the values in the last price column of our dataset from the values in the mean column can let us understand the overall performance of the stock. If a stock’s last price is above the mean, we will anticipate that the stock has a positive value; the previous price is greater than the stock’s average performance. If a stock’s last price is below the mean, we will expect to see a negative value if the last price is less than the stock’s average performance. To illustrate how this would appear, it may be necessary to use which.min() and which.max() functions in R.  We can use R to help locate these rows and print the row information for further interpretation, as described below.


In [None]:
> S <- historicaldata$Last_Price - historicaldata$Mean
> I_1 <- which.min(S)
> I_2 <- which.max(S)
> row_I_1 <- historicaldata[I_1, ]
> row_I_2 <- historicaldata[I_2, ]
> cat("I_1 (lower value) is row:", I_1, "\n")
I_1 (lower value) is row: 2050 
> print(row_I_1)
       Skewness   Median      Mean Standard_Deviation Last_Price
2050 -0.5489826 0.621074 0.5744321          0.2024042 0.02557281
> cat("I_2 (higher value) is at row:", I_2, "\n")
I_2 (higher value) is at row: 1817 
> print(row_I_2)
     Skewness    Median      Mean Standard_Deviation Last_Price
1817  4.08087 0.1030399 0.1179646         0.08939894          1

The output above shows that row 2050 is the least-valued stock. The stock is less than the mean and median at its last price. The skewness of the stock further illustrates this; it skews to the left, and most of the values will be found on the left side of the distribution. The standard deviation is 0.2; this further illustrates that values are spread further from the mean. Row 1817, in contrast, is right-skewed; most of its values will be found on the right side of the distribution. Its last price is greater than both its median and mode. Its standard deviation also illustrates that its data is closer to the mean than the stock in row 2050. By comparing these two stocks, investors would interpret that the stock in row 1817 is less volatile than row 2050 and prefer to purchase that particular stock. For this reason, the difference between the last price and the mean would be a better indication of performance than the skewness of a particular stock. 

### Key Takeaway

It is important to note that a higher skew does not always indicate that a stock is a better purchase. Fractional increases in low-value stock can illustrate that skew does not determine profitability. If we examine row 1144, we can see that the skewness is most positively skewed. However, the difference between the last price and the mean and median are unremarkable. Using R, let us view the attributes of row 1144 in contrast to row 2050. 


In [None]:
> print(historicaldata[1144,])
     Skewness      Median        Mean Standard_Deviation  Last_Price
1144 72.72923 0.000106703 0.000341332         0.01374237 0.000249594
> print(historicaldata[2050,])
       Skewness   Median      Mean Standard_Deviation Last_Price
2050 -0.5489826 0.621074 0.5744321          0.2024042 0.02557281
> print(historicaldata[1817, ])
     Skewness    Median      Mean Standard_Deviation Last_Price
1817  4.08087 0.1030399 0.1179646         0.08939894          1

When interpreted in context, it is easy to see that even with a smaller skewness, the value of 1817 is a better purchase option than row 1144. The standard deviation of row 1144 indicates that the values are closely clustered, and incremental changes in tiny numbers can inflate the interpretation of the skewness (Döpke & Pierdzioch, 2001). Further underscoring the need to understand data in context. Outliers in the distribution could also be attributed to the inflation of skewness. 

### Conclusion

Data scientists add valuable perspective to the interpretation of data. This analysis underscored the usefulness of variance, skewness, mean, median, and standard deviation in aiding those interpretations and data presentations. R is a powerful statistical program designed to assist in the computation and visualization of these concepts to give perspective and context to the data under interpretation. Well-formed analysis can provide decision-makers with actionable insights into business or support hypotheses and conclusions in studies or experiments. Statistical analysis is a tool that aids data scientists in the modern world; it can not only be insightful but also prophetic. 

### References

Bruce, P., & Bruce, A. (2019). Practical statistics for data scientists: 50+ essential concepts using R and Python (2nd ed.). O'Reilly Media.

Döpke, J., & Pierdzioch, C. (2001, July ). Duesternbrooker Weg 120 24105 Kiel Germany (Kiel Working Paper No. 1059).

Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., & Leisch, F. (2023). e1071: Misc functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien (Version 1.7-13) [R package]. Comprehensive R Archive Network (CRAN). https://CRAN.R-project.org/package=e1071

Nield, T. (2022). Essential Math for Data Science (1st ed.). O'Reilly Media.

R Core Team. (2023). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/

Thulin, M. (2025). Modern Statistics with R (2nd ed.). CRC Press.

Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). *dplyr: A grammar of data manipulation* (Version 1.1.2) [R package]. CRAN. https://CRAN.R-project.org/package=dplyr

Wickham, H. (2023). *stringr: Simple, consistent wrappers for common string operations* (Version 1.5.0) [R package]. CRAN. https://CRAN.R-project.org/package=stringr


