# Data Manipulation and Cleaning in R
This notebook is designed for teaching data manipulation and cleaning using R. We will cover the following topics:
- Basic mathematical functions in R
- Filtering, subsetting, and sorting data
- Importing data from files
- Handling and cleaning missing data

Let's get started!

### Basic Mathematical Functions

In [3]:
# Creating a vector of sample data
data <- c(23, 15, 42, 57, 19, 34, 21)

# Calculating basic statistics
sum_data <- sum(data)
mean_data <- mean(data)
median_data <- median(data)
sd_data <- sd(data)

# Displaying results
cat("Sum:", sum_data, "\nMean:", mean_data, "\nMedian:", median_data, "\nStandard Deviation:", sd_data)

Sum: 211 
Mean: 30.14286 
Median: 23 
Standard Deviation: 15.08231

### Filtering and Subsetting Data

In [3]:
# Example: Filtering numbers greater than 30
filtered_data <- subset(data, data > 30)
filtered_data

### Sorting Data

In [4]:
# Sorting data in ascending order
sorted_data <- sort(data)

# Sorting in descending order
sorted_data_desc <- sort(data, decreasing = TRUE)

cat("Ascending Order:", sorted_data, "\nDescending Order:", sorted_data_desc)

Ascending Order: 15 19 21 23 34 42 57 
Descending Order: 57 42 34 23 21 19 15

### Importing and Cleaning Data

In [6]:
# Importing data from a CSV file
data <- read.csv('diabetes.csv')

# Displaying the first few rows of the data
head(data)

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<int>,<int>
1,6.0,148,72,35.0,,33.6,0.627,50,1
2,1.0,85,66,29.0,,26.6,0.351,31,0
3,8.0,183,64,,,23.3,0.672,32,1
4,1.0,89,66,23.0,94.0,28.1,0.167,21,0
5,,137,40,35.0,168.0,43.1,2.288,33,1
6,5.0,116,74,,,25.6,0.201,30,0


### Handling Missing Data

In [8]:
# Creating a sample data frame with missing values
data_with_na <- data.frame(
  name = c("Alice", "Bob", "Charlie", "David"),
  age = c(25, NA, 30, NA),
  score = c(85, 90, NA, 88)
)
data_with_na

name,age,score
<chr>,<dbl>,<dbl>
Alice,25.0,85.0
Bob,,90.0
Charlie,30.0,
David,,88.0


In [9]:
# Removing rows with missing values
data_cleaned <- na.omit(data_with_na)
data_cleaned

Unnamed: 0_level_0,name,age,score
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>
1,Alice,25,85


### Replacing Missing Values

In [10]:
# Replacing NA values with the mean of the column
data_with_na$age[is.na(data_with_na$age)] <- mean(data_with_na$age, na.rm = TRUE)
data_with_na

name,age,score
<chr>,<dbl>,<dbl>
Alice,25.0,85.0
Bob,27.5,90.0
Charlie,30.0,
David,27.5,88.0
