### NumPy Exercises: Student Performance Dataset

#### 📋 Dataset Information

**Dataset**: Students Performance in Exams  
**Source**: https://www.kaggle.com/datasets/spscientist/students-performance-in-exams  
**Size**: 1,000 students  

**Columns**:
- `gender`: Student's gender (male/female)
- `race/ethnicity`: Student's ethnic group (group A, B, C, D, E)
- `parental level of education`: Parent's education level
- `lunch`: Type of lunch (standard/free or reduced)
- `test preparation course`: Completed test prep course (completed/none)
- `math score`: Score in math (0-100)
- `reading score`: Score in reading (0-100)
- `writing score`: Score in writing (0-100)

In [4]:
## NO NEED CODE HERE
import numpy as np

# Load the dataset
data = np.genfromtxt('Data/StudentsPerformance.csv', delimiter=',', 
                     dtype=None, encoding='utf-8', names=True)

# For easier manipulation, separate numeric scores into a 2D array
math_scores = data['math_score']
reading_scores = data['reading_score']
writing_scores = data['writing_score']

# Create a 2D array with all scores
scores = np.column_stack([math_scores, reading_scores, writing_scores])

print("Dataset loaded successfully!")
print(f"Number of students: {len(data)}")
print(f"Scores array shape: {scores.shape}")

Dataset loaded successfully!
Number of students: 1000
Scores array shape: (1000, 3)


### Task 1: Explore the Dataset Structure
**Instructions**:
1. Print the shape of the `scores` array
2. Print the data type of the `scores` array
3. Calculate the total number of elements in the `scores` array
4. Print the first 5 students' scores
5. Print the last 3 students' scores

In [7]:
# TODO: Your code here

# Ép kiểu điểm thành float.
math_scores = np.char.strip(math_scores.astype(str), '"').astype(float)
reading_scores = np.char.strip(reading_scores.astype(str), '"').astype(float)
writing_scores = np.char.strip(writing_scores.astype(str), '"').astype(float)

scores = np.column_stack([math_scores, reading_scores, writing_scores])

# 1. Shape
print("Shape:", scores.shape)

# 2. Data type
print("Data type:", scores.dtype)

# 3. Total elements
print("Total elements:", scores.size)

# 4. First 5 students
print("First 5 students:\n", scores[:5])

# 5. Last 3 students
print("Last 3 students:\n", scores[-3:])

Shape: (1000, 3)
Data type: float64
Total elements: 3000
First 5 students:
 [[72. 72. 74.]
 [69. 90. 88.]
 [90. 95. 93.]
 [47. 57. 44.]
 [76. 78. 75.]]
Last 3 students:
 [[59. 71. 65.]
 [68. 78. 77.]
 [77. 86. 86.]]


### Task 2: Access Specific Student Data
**Instructions**:
1. Get the scores of the 10th student (remember: Python uses 0-indexing)
2. Get the math score of the 50th student
3. Get the writing score of the 100th student
4. Get all three scores for student at index 25
5. Create a new array containing scores of students at indices 0, 5, and 10

In [8]:
# TODO: Your code here
# 1. 10th student (all scores)
student_10 = scores[9,:]
print("Student 10 scores:", student_10)

# 2. Math score of 50th student
math_50 = math_scores[49]
print("Math score of student 50:", math_50)

# 3. Writing score of 100th student
writing_100 = writing_scores[99]
print("Writing score of student 100:", writing_100)

# 4. All scores for student 25
student_25 = scores[24, :]
print("Student 25 scores:", student_25)

# 5. Scores for students 0, 5, and 10
selected_students = np.array(scores[[0,5,10], :])
print("Selected students:\n", selected_students)

Student 10 scores: [38. 60. 50.]
Math score of student 50: 82.0
Writing score of student 100: 62.0
Student 25 scores: [74. 71. 80.]
Selected students:
 [[72. 72. 74.]
 [71. 83. 78.]
 [58. 54. 52.]]


### Task 3: Filter Students by Performance
**Instructions**:
1. Find all students who scored above 80 in math
2. Find all students who scored below 50 in reading
3. Find all students who scored exactly 75 in writing
4. Count how many students scored above 90 in math
5. Find students who scored between 60 and 70 (inclusive) in reading

In [9]:
# TODO: Your code here
# 1. Students with math > 80
high_math_score = data[math_scores > 80]
    
# 2. Students with reading < 50
low_reading_score = data[reading_scores < 50]

# 3. Students with writing exactly 75
_75_writing_score = data[writing_scores == 75]

# 4. Count students with math > 90
_numberOf_90_math_score = len(data[math_scores > 90])
print(_numberOf_90_math_score)

# 5. Students with reading between 60-70
_6070_reading_score = data[(reading_scores > 60) & (reading_scores <= 70)]

50


### Task 4: Compare Two Subjects
**Instructions**:
1. Find students who scored higher in math than in reading
2. Find students who scored the same in reading and writing
3. Count how many students scored higher in writing than in math
4. Find the student(s) with the largest difference between math and writing scores
5. Calculate the average difference between reading and writing scores

In [12]:
# TODO: Your code here
# 1. Students with math > reading
math_isGreaterthan_reading = data[math_scores > reading_scores]

# 2. Students with reading == writing
reading_isEqualto_writing = data[reading_scores == writing_scores]

# 3. Count students with writing > math
numberOf_math_isGreaterthan_reading = len(data[writing_scores > math_scores])

# 4. Largest difference between math and writing
largestDiff_student = []
largestDiff_score = -999
for i, student in enumerate(data):
    diff = abs(math_scores[i] - writing_scores[i])
    if diff > largestDiff_score:
        largestDiff_score = diff
        largestDiff_student = [student]
    elif diff == largestDiff_score:
        largestDiff_student.append(student)
print('Students who have the largest difference between math and writing:')
for sd in largestDiff_student:
    print(sd)
    
# 5. Average difference between reading and writing
diff_reading_writing = np.array(abs(reading_scores - writing_scores))
avg_diff = np.mean(diff_reading_writing)
print()
print(avg_diff)

Students who have the largest difference between math and writing
('"female"', '"group C"', '"bachelor\'s degree"', '"free/reduced"', '"completed"', '"51"', '"72"', '"79"')

3.739


### Task 5: Score Distribution Analysis 
Analyze how scores are distributed across different ranges without using histograms or binning functions

**Instructions**:

1. **Create score bins**: Define 10 ranges for scores:
   - Bin 0: 0-9 points
   - Bin 1: 10-19 points
   - Bin 2: 20-29 points
   - ... and so on ...
   - Bin 9: 90-100 points

2. **Count students in each bin for math scores**: Use a loop to go through all math scores and count how many fall into each bin.

3. **Display the distribution**: Print the count for each bin in a readable format.

4. **Find the most common range**: Determine which bin has the most students (the mode range).

5. **Calculate cumulative distribution**: For each bin, calculate how many students scored at or below that range.

6. **Repeat for reading and writing**: Apply the same analysis to the other two subjects.

7. **Compare distributions**: Identify which subject has the most students in the highest bin (90-100).

In [13]:
#TODO: your code here

# 1. Create an array to store counts for 10 bins
print('-----MATH------')
math_bins = [0] * 10

# 2. Count students in each bin
for s in math_scores:
    index = min(int(s//10),9)
    math_bins[index] += 1
    
# 3. Display the distribution
for i, count in enumerate(math_bins):
    low, high = i*10, i*10+9
    if i ==9:
        high = 100
    print(f'Bin {i} ({low} - {high}): {count}')
    
# 4. Find the most common range
max_bin = max(range(10), key=lambda i:math_bins[i])
print('------------')
print(f'Most common range: Bin {max_bin} with {math_bins[max_bin]} students')

# 5. Calculate cumulative distribution
math_cum = np.cumsum(math_bins)
print('------------')
print(f'Math cumulative: {math_cum.tolist()}')

# 6. Repeat for reading and writing
print()
print('-----READING------')
reading_bins = [0] * 10
for s in reading_scores:
    index = min(int(s//10),9)
    reading_bins[index] += 1
for i, count in enumerate(reading_bins):
    low, high = i*10, i*10+9
    if i ==9:
        high = 100
    print(f'Bin {i} ({low} - {high}): {count}')
max_bin = max(range(10), key=lambda i:reading_bins[i])
print('------------')
print(f'Most common range: Bin {max_bin} with {reading_bins[max_bin]} students')
reading_cum = np.cumsum(reading_bins)
print('------------')
print(f'Reading cumulative: {reading_cum.tolist()}')

print()
print('-----WRITING------')
writing_bins = [0] * 10
for s in writing_scores:
    index = min(int(s//10),9)
    writing_bins[index] += 1
for i, count in enumerate(writing_bins):
    low, high = i*10, i*10+9
    if i ==9:
        high = 100
    print(f'Bin {i} ({low} - {high}): {count}')
max_bin = max(range(10), key=lambda i:writing_bins[i])
print('------------')
print(f'Most common range: Bin {max_bin} with {writing_bins[max_bin]} students')
writing_cum = np.cumsum(reading_bins)
print('------------')
print(f'Writing cumulative: {writing_cum.tolist()}')

# 7. Compare distributions 
best_subject = max([('Math', math_bins[9]), ('Reading', reading_bins[9]), ('Writing', writing_bins[9])],
                   key=lambda x: x[1])
print()
print(f'MOST STUDENTS IN 90-100: {best_subject[0]} ({best_subject[1]} students)')

-----MATH------
Bin 0 (0 - 9): 2
Bin 1 (10 - 19): 2
Bin 2 (20 - 29): 10
Bin 3 (30 - 39): 26
Bin 4 (40 - 49): 95
Bin 5 (50 - 59): 188
Bin 6 (60 - 69): 268
Bin 7 (70 - 79): 216
Bin 8 (80 - 89): 135
Bin 9 (90 - 100): 58
------------
Most common range: Bin 6 with 268 students
------------
Math cumulative: [2, 4, 14, 40, 135, 323, 591, 807, 942, 1000]

-----READING------
Bin 0 (0 - 9): 0
Bin 1 (10 - 19): 1
Bin 2 (20 - 29): 7
Bin 3 (30 - 39): 18
Bin 4 (40 - 49): 64
Bin 5 (50 - 59): 164
Bin 6 (60 - 69): 233
Bin 7 (70 - 79): 264
Bin 8 (80 - 89): 170
Bin 9 (90 - 100): 79
------------
Most common range: Bin 7 with 264 students
------------
Reading cumulative: [0, 1, 8, 26, 90, 254, 487, 751, 921, 1000]

-----WRITING------
Bin 0 (0 - 9): 0
Bin 1 (10 - 19): 3
Bin 2 (20 - 29): 6
Bin 3 (30 - 39): 23
Bin 4 (40 - 49): 82
Bin 5 (50 - 59): 167
Bin 6 (60 - 69): 230
Bin 7 (70 - 79): 254
Bin 8 (80 - 89): 157
Bin 9 (90 - 100): 78
------------
Most common range: Bin 7 with 254 students
------------
Writing c

### Task 6: Student Consistency Analysis
Identify students who have very consistent scores across all three subjects without using variance or standard deviation functions.

**Instructions**:

1. **Calculate score range for each student**: For each student, find the difference between their highest and lowest score among the three subjects.

2. **Define consistency threshold**: Students are "consistent" if their range is 5 points or less (all three scores within 5 points of each other).

3. **Count consistent students**: Count how many students meet the consistency criteria.

4. **Find the most consistent student**: Identify the student with the smallest range (most consistent).

5. **Find the least consistent student**: Identify the student with the largest range (most variable).

6. **Create consistency categories**:
   - Very Consistent: range ≤ 5
   - Consistent: range 6-10
   - Moderate: range 11-20
   - Variable: range 21-30
   - Highly Variable: range > 30

7. **Count students in each category**: Display the distribution.

8. **Analyze patterns**: Among consistent students, which subject combinations tend to be similar?

In [30]:
# TODO: Your code here
# 1. Calculate score range for each student
max_perStudent = scores.max(axis =1)
min_perStudent = scores.min(axis =1)
score_range = max_perStudent - min_perStudent

# 2 & 3. Count consistent students 
mask = score_range <= 5
number_Consistent = int(mask.sum())
print(f'Count consistent students: {number_Consistent} students.')

# 4. Find most consistent student
min_score_range = score_range.min()
index = np.where(score_range == min_score_range)[0]
print()
for i in index:
    print(f'Student with index {i}: scores = {scores[i].tolist()}')
    
# 5. Find least consistent student
max_score_range = score_range.max()
index = np.where(score_range == max_score_range)[0]
print()
for i in index:
    print(f'Student with index {i}: scores = {scores[i].tolist()}')
    
# 6. Create consistency categories
cats = {
    "Very Consistent (≤5)":        (score_range <= 5),
    "Consistent (6–10)":          ((score_range >= 6)  & (score_range <= 10)),
    "Moderate (11–20)":           ((score_range >= 11) & (score_range <= 20)),
    "Variable (21–30)":           ((score_range >= 21) & (score_range <= 30)),
    "Highly Variable (>30)":       (score_range > 30),
}

# 7. Count students in each category
print()
for name, mask in cats.items():
    print(f'{name}: {int(mask.sum())} students')
    
# 8. Analyze patterns among consistent students
mr_diff = np.abs(scores[:, 0] - scores[:, 1])  # Math vs Reading
mw_diff = np.abs(scores[:, 0] - scores[:, 2])  # Math vs Writing
rw_diff = np.abs(scores[:, 1] - scores[:, 2])  # Reading vs Writing

diffs_all = np.stack([mr_diff, mw_diff, rw_diff], axis =1)

diffs_consistent = diffs_all[mask]

closest_pair_index = np.argmin(diffs_consistent, axis=1)
pair_names = np.array(["Math-Reading", "Math-Writing", "Reading-Writing"])

unique, counts = np.unique(closest_pair_index, return_counts=True)

print()
print("Among consistent students, closest subject pairs:")
for u, c in zip(unique, counts):
    print(f'{pair_names[u]}: {c}')


Count consistent students: 231 students.

Student with index 157: scores = [60.0, 60.0, 60.0]
Student with index 458: scores = [100.0, 100.0, 100.0]
Student with index 796: scores = [70.0, 70.0, 70.0]
Student with index 916: scores = [100.0, 100.0, 100.0]
Student with index 962: scores = [100.0, 100.0, 100.0]

Student with index 371: scores = [45.0, 73.0, 70.0]
Student with index 414: scores = [51.0, 72.0, 79.0]

Very Consistent (≤5): 231 students
Consistent (6–10): 365 students
Moderate (11–20): 372 students
Variable (21–30): 32 students
Highly Variable (>30): 0 students

Among consistent students, closest subject pairs:
