# Descriptive Statistics

This section presents descriptive statistics for the Student Placement Dataset.
The goal is to summarize central tendency, variability, and distributional
characteristics of the most important attributes.


In [1]:
library(tidyverse)

df <- read.csv("train.csv")

str(df)
summary(df)


── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.6
[32m✔[39m [34mforcats  [39m 1.0.1     [32m✔[39m [34mstringr  [39m 1.6.0
[32m✔[39m [34mggplot2  [39m 4.0.1     [32m✔[39m [34mtibble   [39m 3.3.0
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.2
[32m✔[39m [34mpurrr    [39m 1.2.0     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


'data.frame':	45000 obs. of  15 variables:
 $ Student_ID          : int  1048 37820 49668 19467 23094 8710 24363 27448 23663 39336 ...
 $ Age                 : int  22 20 22 22 20 20 21 19 24 23 ...
 $ Gender              : chr  "Female" "Female" "Male" "Male" ...
 $ Degree              : chr  "B.Tech" "BCA" "MCA" "MCA" ...
 $ Branch              : chr  "ECE" "ECE" "ME" "ME" ...
 $ CGPA                : num  6.29 6.05 7.22 7.78 7.63 7.99 7.5 8 6.24 7.08 ...
 $ Internships         : int  0 1 1 2 1 1 1 0 0 2 ...
 $ Projects            : int  3 4 4 4 4 4 4 4 3 4 ...
 $ Coding_Skills       : int  4 6 6 6 6 5 6 4 3 7 ...
 $ Communication_Skills: int  6 8 6 6 5 7 4 5 10 6 ...
 $ Aptitude_Test_Score : int  51 59 58 90 79 84 71 74 54 59 ...
 $ Soft_Skills_Rating  : int  5 8 6 4 6 6 8 3 7 6 ...
 $ Certifications      : int  1 2 2 2 2 2 2 1 1 2 ...
 $ Backlogs            : int  3 1 2 0 0 0 1 0 2 2 ...
 $ Placement_Status    : chr  "Not Placed" "Not Placed" "Not Placed" "Placed" ...


   Student_ID         Age        Gender             Degree         
 Min.   :    1   Min.   :18   Length:45000       Length:45000      
 1st Qu.:12510   1st Qu.:19   Class :character   Class :character  
 Median :24958   Median :21   Mode  :character   Mode  :character  
 Mean   :24978   Mean   :21                                        
 3rd Qu.:37475   3rd Qu.:23                                        
 Max.   :50000   Max.   :24                                        
    Branch               CGPA        Internships        Projects    
 Length:45000       Min.   :4.500   Min.   :0.0000   Min.   :1.000  
 Class :character   1st Qu.:6.320   1st Qu.:0.0000   1st Qu.:3.000  
 Mode  :character   Median :7.000   Median :1.0000   Median :4.000  
                    Mean   :7.002   Mean   :0.7741   Mean   :3.734  
                    3rd Qu.:7.670   3rd Qu.:1.0000   3rd Qu.:4.000  
                    Max.   :9.800   Max.   :3.0000   Max.   :6.000  
 Coding_Skills    Communication_Skills Ap

## Numeric Summary Statistics

The following statistics summarize numeric attributes using mean, median,
range, standard deviation, and variance to understand distributional behavior.


In [2]:
numeric_summary <- df %>%
  select(where(is.numeric)) %>%
  summarise(across(everything(), list(
    mean = ~mean(.x, na.rm = TRUE),
    median = ~median(.x, na.rm = TRUE),
    min = ~min(.x, na.rm = TRUE),
    max = ~max(.x, na.rm = TRUE),
    sd = ~sd(.x, na.rm = TRUE),
    var = ~var(.x, na.rm = TRUE)
  )))

numeric_summary


Student_ID_mean,Student_ID_median,Student_ID_min,Student_ID_max,Student_ID_sd,Student_ID_var,Age_mean,Age_median,Age_min,Age_max,⋯,Certifications_min,Certifications_max,Certifications_sd,Certifications_var,Backlogs_mean,Backlogs_median,Backlogs_min,Backlogs_max,Backlogs_sd,Backlogs_var
<dbl>,<dbl>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<int>,⋯,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<int>,<dbl>,<dbl>
24977.96,24957.5,1,50000,14425.61,208098100,20.99933,21,18,24,⋯,0,3,0.6501039,0.4226351,0.8881333,1,0,3,0.9709538,0.9427512


In [3]:
range_summary <- df %>%
  select(where(is.numeric)) %>%
  summarise(across(everything(), ~max(.x, na.rm = TRUE) - min(.x, na.rm = TRUE)))

range_summary


Student_ID,Age,CGPA,Internships,Projects,Coding_Skills,Communication_Skills,Aptitude_Test_Score,Soft_Skills_Rating,Certifications,Backlogs
<int>,<int>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
49999,6,5.3,3,5,9,9,65,9,3,3


## Categorical Variable Summary

Modes and frequency counts are used to summarize categorical attributes.


In [4]:
get_mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

categorical_modes <- df %>%
  select(where(is.character)) %>%
  summarise(across(everything(), get_mode))

categorical_modes


Gender,Degree,Branch,Placement_Status
<chr>,<chr>,<chr>,<chr>
Female,B.Sc,ME,Not Placed


In [5]:
table(df$Placement_Status)

prop.table(table(df$Placement_Status)) * 100



Not Placed     Placed 
     28688      16312 


Not Placed     Placed 
  63.75111   36.24889 

The placement outcome shows a class imbalance, with a larger proportion of
students not placed. This is important to consider when interpreting results
and building predictive models.


In [6]:
missing_summary <- sapply(df, function(x) sum(is.na(x)))
missing_summary


No missing values were observed in the dataset, indicating high data quality
and reducing the need for preprocessing or imputation.


In [7]:
mean(df$CGPA, na.rm = TRUE)
median(df$CGPA, na.rm = TRUE)
