# Chapter 2: Statistical Analysis in Soccer - Basic Data Analysis

This notebook demonstrates fundamental data analysis techniques in R, including:
- Basic data manipulation
- Descriptive statistics
- Data visualization
- Statistical testing
- Working with external data

## Example 2.1: Basic Data Manipulation

Working with basic vectors and calculating summary statistics for match data.

In [1]:
# Clear workspace
rm(list = ls())    

# Create match data
MatchID <- c(1:10)  
print(MatchID)  

GoalsFor <- c(0,2,4,1,3,0,2,2,3,1)   
print(GoalsFor)  

GoalsAgainst <- c(1,1,3,3,0,0,1,1,1,0)  
print(GoalsAgainst)  

GoalDiff <- GoalsFor-GoalsAgainst   
print(GoalDiff)   

# Calculate summary statistics
# Goals scored:
cat("Goals For Statistics:\n")
cat("Mean:", mean(GoalsFor), "\n")
cat("Median:", median(GoalsFor), "\n")
cat("SD:", sd(GoalsFor), "\n")
cat("Variance:", var(GoalsFor), "\n\n")

# Goals conceded:
cat("Goals Against Statistics:\n")
cat("Mean:", round(mean(GoalsAgainst),3), "\n")
cat("Median:", round(median(GoalsAgainst),3), "\n")
cat("SD:", round(sd(GoalsAgainst),3), "\n")
cat("Variance:", round(var(GoalsAgainst),3), "\n\n")

# Goal difference:
cat("Goal Difference Statistics:\n")
cat("Mean:", round(mean(GoalDiff),3), "\n")
cat("Median:", round(median(GoalDiff),3), "\n")
cat("SD:", round(sd(GoalDiff),3), "\n")
cat("Variance:", round(var(GoalDiff),3), "\n")

# Compile data frame
goals_dat <- cbind.data.frame(MatchID, GoalsFor, GoalsAgainst, GoalDiff)
print(goals_dat)

 [1]  1  2  3  4  5  6  7  8  9 10
 [1] 0 2 4 1 3 0 2 2 3 1
 [1] 1 1 3 3 0 0 1 1 1 0
 [1] -1  1  1 -2  3  0  1  1  2  1
Goals For Statistics:
Mean: 1.8 
 [1] 0 2 4 1 3 0 2 2 3 1
 [1] 1 1 3 3 0 0 1 1 1 0
 [1] -1  1  1 -2  3  0  1  1  2  1
Goals For Statistics:
Mean: 1.8 
Median: 2 
SD: 1.316561 
Variance: 1.733333 

Goals Against Statistics:
Mean: 1.1 
Median: 1 
Median: 2 
SD: 1.316561 
Variance: 1.733333 

Goals Against Statistics:
Mean: 1.1 
Median: 1 
SD: 1.101 
Variance: 1.211 

Goal Difference Statistics:
Mean: 0.7 
Median: 1 
SD: 1.101 
Variance: 1.211 

Goal Difference Statistics:
Mean: 0.7 
Median: 1 
SD: 1.418 
Variance: 2.011 
   MatchID GoalsFor GoalsAgainst GoalDiff
1        1        0            1       -1
2        2        2            1        1
3        3        4            3        1
4        4        1            3       -2
5        5        3            0        3
6        6        0            0        0
7        7        2            1        1
8        8        2

## Example 2.2: Data Frame Operations

Exploring different ways to examine and manipulate data frames.

In [2]:
# Examine data frame structure
names(goals_dat)

head(goals_dat)

head(goals_dat, 8)  # First eight rows

tail(goals_dat)

# Data frame dimensions
nrow(goals_dat)  # Number of rows
ncol(goals_dat)  # Number of columns
dim(goals_dat)   # Both dimensions

str(goals_dat)   # Structure of the data

# Different methods to access data
# Method 1: Using $ notation
print("Goals For using $ notation:")
goals_dat$GoalsFor  

# Method 2: Using square brackets
print("\nGoals For using column index:")
goals_dat[,2]  

# Multiple columns
print("\nGoals For and Against:")
goals_dat[,c(2,3)]  

# Selecting specific rows
print("\nRows 3-5:")
goals_dat[c(3,4,5),]

Unnamed: 0_level_0,MatchID,GoalsFor,GoalsAgainst,GoalDiff
Unnamed: 0_level_1,<int>,<dbl>,<dbl>,<dbl>
1,1,0,1,-1
2,2,2,1,1
3,3,4,3,1
4,4,1,3,-2
5,5,3,0,3
6,6,0,0,0


Unnamed: 0_level_0,MatchID,GoalsFor,GoalsAgainst,GoalDiff
Unnamed: 0_level_1,<int>,<dbl>,<dbl>,<dbl>
1,1,0,1,-1
2,2,2,1,1
3,3,4,3,1
4,4,1,3,-2
5,5,3,0,3
6,6,0,0,0
7,7,2,1,1
8,8,2,1,1


Unnamed: 0_level_0,MatchID,GoalsFor,GoalsAgainst,GoalDiff
Unnamed: 0_level_1,<int>,<dbl>,<dbl>,<dbl>
5,5,3,0,3
6,6,0,0,0
7,7,2,1,1
8,8,2,1,1
9,9,3,1,2
10,10,1,0,1


'data.frame':	10 obs. of  4 variables:
 $ MatchID     : int  1 2 3 4 5 6 7 8 9 10
 $ GoalsFor    : num  0 2 4 1 3 0 2 2 3 1
 $ GoalsAgainst: num  1 1 3 3 0 0 1 1 1 0
 $ GoalDiff    : num  -1 1 1 -2 3 0 1 1 2 1
[1] "Goals For using $ notation:"
[1] "Goals For using $ notation:"


[1] "\nGoals For using column index:"


[1] "\nGoals For and Against:"


GoalsFor,GoalsAgainst
<dbl>,<dbl>
0,1
2,1
4,3
1,3
3,0
0,0
2,1
2,1
3,1
1,0


[1] "\nRows 3-5:"


Unnamed: 0_level_0,MatchID,GoalsFor,GoalsAgainst,GoalDiff
Unnamed: 0_level_1,<int>,<dbl>,<dbl>,<dbl>
3,3,4,3,1
4,4,1,3,-2
5,5,3,0,3


## Example 2.3: Conditional Operations

Demonstrating different methods for creating match outcomes using conditional logic.

In [3]:
# Method 1: Using ifelse
outcome1 <- ifelse(goals_dat$GoalsFor > goals_dat$GoalsAgainst, "Win", "Did not win")
print("Outcome using ifelse:")
print(outcome1)

# Method 2: Using if statements and for loop
outcome2 <- c()  # Empty vector for results
n <- nrow(goals_dat)  
for(i in 1:n){
  if(goals_dat$GoalsFor[i] > goals_dat$GoalsAgainst[i]){outcome2[i] <- "Win"}
  if(goals_dat$GoalsFor[i] < goals_dat$GoalsAgainst[i]){outcome2[i] <- "Lose"}
  if(goals_dat$GoalsFor[i] == goals_dat$GoalsAgainst[i]){outcome2[i] <- "Draw"}
}

print("\nOutcome using for loop:")
print(outcome2)

# Add results to data frame
match_dat <- cbind.data.frame(goals_dat, outcome2)
names(match_dat)

# Rename outcome2 column
colnames(match_dat)[colnames(match_dat) == 'outcome2'] <- 'Result'
print("\nFinal data frame:")
print(match_dat)

# Filter for wins
winResults <- match_dat[match_dat$Result == "Win",]
print("\nWinning matches:")
print(winResults)

[1] "Outcome using ifelse:"
 [1] "Did not win" "Win"         "Win"         "Did not win" "Win"        
 [6] "Did not win" "Win"         "Win"         "Win"         "Win"        
 [1] "Did not win" "Win"         "Win"         "Did not win" "Win"        
 [6] "Did not win" "Win"         "Win"         "Win"         "Win"        
[1] "\nOutcome using for loop:"
 [1] "Lose" "Win"  "Win"  "Lose" "Win"  "Draw" "Win"  "Win"  "Win"  "Win" 
[1] "\nOutcome using for loop:"
 [1] "Lose" "Win"  "Win"  "Lose" "Win"  "Draw" "Win"  "Win"  "Win"  "Win" 


[1] "\nFinal data frame:"
   MatchID GoalsFor GoalsAgainst GoalDiff Result
1        1        0            1       -1   Lose
2        2        2            1        1    Win
3        3        4            3        1    Win
4        4        1            3       -2   Lose
5        5        3            0        3    Win
6        6        0            0        0   Draw
7        7        2            1        1    Win
8        8        2            1        1    Win
9        9        3            1        2    Win
10      10        1            0        1    Win
[1] "\nWinning matches:"
   MatchID GoalsFor GoalsAgainst GoalDiff Result
2        2        2            1        1    Win
3        3        4            3        1    Win
5        5        3            0        3    Win
7        7        2            1        1    Win
8        8        2            1        1    Win
9        9        3            1        2    Win
10      10        1            0        1    Win
   MatchID GoalsFo

## Example 2.4: Advanced Descriptive Statistics

Using the psych package for more comprehensive statistical analysis.

In [4]:
# Basic summary
print("Basic summary:")
summary(match_dat)

# Load psych package
library(psych)

# Detailed descriptive statistics
print("\nDetailed statistics:")
des_res <- describeBy(match_dat[,c(2:4)])
print(des_res)

[1] "Basic summary:"


    MatchID         GoalsFor     GoalsAgainst     GoalDiff    
 Min.   : 1.00   Min.   :0.00   Min.   :0.00   Min.   :-2.00  
 1st Qu.: 3.25   1st Qu.:1.00   1st Qu.:0.25   1st Qu.: 0.25  
 Median : 5.50   Median :2.00   Median :1.00   Median : 1.00  
 Mean   : 5.50   Mean   :1.80   Mean   :1.10   Mean   : 0.70  
 3rd Qu.: 7.75   3rd Qu.:2.75   3rd Qu.:1.00   3rd Qu.: 1.00  
 Max.   :10.00   Max.   :4.00   Max.   :3.00   Max.   : 3.00  
    Result         
 Length:10         
 Class :character  
 Mode  :character  
                   
                   
                   

ERROR: Error in library(psych): there is no package called ‘psych’


## Example 2.5: Working with External Data

Reading and analyzing the Arsenal-Chelsea comparison dataset.

In [None]:
# Clear workspace and read data
rm(list = ls())
dat <- read.csv("Arsenal_Chelsea_comparison.csv")
print("Data overview:")
print(dat)

# Produce descriptive statistics
library(psych)
des_results <- describeBy(dat[,c(2:7)])
print("\nDescriptive statistics:")
print(des_results)

## Example 2.6: Time Series Visualization

Creating a time series plot comparing Arsenal and Chelsea's performance.

In [None]:
# Create time series plot
seasons <- c("2011","2012","2013","2014","2015","2016","2017","2018","2019","2020")

plot(seasons, dat$Arsenal_GF, type="o", lty=1, pch=20, col="black", ylim=c(0,140), 
     ylab="Goals", xlab="Season")
lines(seasons, dat$Chelsea_GF, type="o", lty=2, pch=20)
lines(seasons, dat$Arsenal_GA, type="o", lty=1, pch=4)
lines(seasons, dat$Chelsea_GA, type="o", lty=2, pch=4)
legend(2011,145, c("Arsenal goals for","Arsenal goals against","Chelsea goals for",
      "Chelsea goals against"), cex=0.8, col=c("black","black","black","black"),
       lty=c(1,1,2,2), pch=c(20,4,20,4), bty = "n")
title("Arsenal and Chelsea comparison")

## Example 2.7: Box Plot Analysis

Comparing goal distributions using box plots.

In [None]:
# Create box plot
boxplot(dat[,c(2,3,5,6)], ylab="Goals")

# Show summary statistics
summary(dat[,c(2,3,5,6)])

## Example 2.8: Scatter Plot with Regression Lines

Analyzing the relationship between goals conceded and points.

In [None]:
# Create scatter plot with regression lines
plot(dat$Chelsea_GA, dat$Chelsea_points, pch=20, col="black", xlim=c(0,60), 
     ylim=c(0,100), ylab="Points", xlab="Goals conceded")
points(dat$Arsenal_GA, dat$Chelsea_points, pch=4)
abline(lm(dat$Chelsea_points ~ dat$Chelsea_GA), lty=1)
abline(lm(dat$Arsenal_points ~ dat$Arsenal_GA), lty=2)
legend(5,40, c("Arsenal goals conceded","Arsenal bestfit line","Chelsea goals conceded","Chelsea bestfit line"),
       cex=0.8, col=c("black","black","black","black"), 
       lty=c(0,1,0,2), pch=c(4,NA,20,NA), bty = "n")

## Example 2.9: Statistical Tests

Performing paired t-tests and correlation analysis.

In [None]:
# Paired t-tests
print("Paired t-test for Goals For:")
t.test(dat$Arsenal_GF, dat$Chelsea_GF, paired=TRUE)

print("\nPaired t-test for Goals Against:")
t.test(dat$Arsenal_GA, dat$Chelsea_GA, paired=TRUE)

# Correlation tests
print("\nArsenal correlation test (Goals Against vs Points):")
cor.test(dat$Arsenal_GA,dat$Arsenal_points)

print("\nChelsea correlation test (Goals Against vs Points):")
cor.test(dat$Chelsea_GA,dat$Chelsea_points)

## Summary

Key concepts covered in this chapter:

1. Data Manipulation:
   - Vector operations
   - Data frame creation and manipulation
   - Conditional operations

2. Statistical Analysis:
   - Basic summary statistics
   - Advanced descriptive statistics (psych package)
   - Paired t-tests
   - Correlation analysis

3. Data Visualization:
   - Time series plots
   - Box plots
   - Scatter plots with regression lines

4. External Data:
   - Reading CSV files
   - Data import and export
   - Working with real soccer data