## Lesson 1: Introduction to R, RStudio, and Data Import

1. Introduction to R and RStudio
   RStudio is an Integrated Development Environment (IDE) for R.
   It provides a console, script editor, environment pane, and plots pane.


## Variables and Assignment in R

In R, we can store values in variables using assignment operators. There are two main ways to assign values:

1. **`<-`** (preferred): This is the traditional R assignment operator
2. **`=`**: This also works for assignment (similar to other programming languages)

**Why use `<-`?** It's the R convention and makes code more readable. It clearly shows the direction of assignment.

Let's see how to create variables:

In [9]:
# 2. Basic R Syntax and Data Types

# Variables
x <- 10
y = 5 # another way to assign


## Displaying Values with print()

The `print()` function displays the value of variables in the console. This is useful for:
- Checking what's stored in a variable
- Displaying results
- Debugging your code

**Note:** In interactive R sessions, you can also just type the variable name to see its value, but `print()` is more explicit and works better in scripts.

In [10]:
print(x)
print(y)

[1] 10
[1] 5


## Arithmetic Operations in R

R can perform mathematical calculations just like a calculator. The basic arithmetic operators are:

- **`+`**: Addition
- **`-`**: Subtraction  
- **`*`**: Multiplication
- **`/`**: Division
- **`^` or `**`**: Exponentiation (power)
- **`%%`**: Modulo (remainder after division)

We can store the results of these operations in new variables for later use:

In [11]:
# Basic arithmetic operations
sum_xy <- x + y
diff_xy <- x - y
prod_xy <- x * y
div_xy <- x / y

## Combining Text and Numbers with paste()

The `paste()` function is very useful for combining (concatenating) text and numbers into readable messages.

**Syntax:** `paste("text", variable, "more text")`

This is particularly helpful when you want to display results with descriptive labels, making your output more informative and professional-looking.

In [12]:
print(paste("Sum:", sum_xy))
print(paste("Difference:", diff_xy))
print(paste("Product:", prod_xy))
print(paste("Division:", div_xy))

[1] "Sum: 15"
[1] "Difference: 5"
[1] "Product: 50"
[1] "Division: 2"


## Vectors: R's Basic Data Structure

**Vectors** are the most fundamental data structure in R. Think of them as a list of items of the same type.

**Key points about vectors:**
- All elements must be the same data type (all numbers, all text, etc.)
- Created using the `c()` function (c stands for "combine" or "concatenate")
- Can contain numbers, text (characters), logical values (TRUE/FALSE), etc.
- Very powerful for data analysis - you can perform operations on entire vectors at once!

**Common vector types:**
- **Numeric**: `c(1, 2, 3.5, 4.2)` 
- **Character**: `c("apple", "banana", "orange")`
- **Logical**: `c(TRUE, FALSE, TRUE)`

In [13]:
# Vectors (1D array of same data type)
numbers <- c(1, 2, 3, 4, 5)
fruits <- c("apple", "banana", "orange")

## Printing Vectors

Let's see what our vectors look like when we print them. Notice how R displays vectors with numbers in brackets `[1]` - this shows the position/index of the first element in each line.

In [14]:
print(numbers)
print(fruits)

[1] 1 2 3 4 5
[1] "apple"  "banana" "orange"


## Factors: For Categorical Data

**Factors** are R's way of handling categorical data - data that falls into specific categories or groups.

**When to use factors:**
- Colors (red, green, blue)
- Survey responses (Agree, Neutral, Disagree) 
- Categories (Small, Medium, Large)
- Gender (Male, Female, Other)

**Why use factors instead of regular text?**
- R knows these are categories, not just random text
- More memory efficient
- Enables proper statistical analysis
- Controls the order of categories (useful for charts and analysis)

**Note:** When you print a factor, R shows you the values AND the "Levels" (all possible categories).

In [15]:
# Factors (for categorical data)
colors <- factor(c("red", "green", "blue", "red"))
print(colors)

[1] red   green blue  red  
Levels: blue green red


## Lists: Mixed Data Types

**Lists** are like containers that can hold different types of data together. Unlike vectors, lists are very flexible!

**What makes lists special:**
- Can contain different data types in the same structure
- Can hold numbers, text, vectors, even other lists!
- Elements can be named (like in our example: "Name", "Age", "Numbers")
- Very useful for organizing complex data

**Think of a list like a file folder:** 
- Each item has a label (the name)
- Each item can contain different types of content
- You can access items by their name or position

This makes lists perfect for storing related information that might be different types (like a person's name, age, and favorite numbers).

In [16]:
# Lists (can contain different data types)
my_list <- list("Name" = "Alice", "Age" = 30, "Numbers" = c(1, 2, 3))
print(my_list)

$Name
[1] "Alice"

$Age
[1] 30

$Numbers
[1] 1 2 3



## Data Frames: The Heart of Data Analysis

**Data Frames** are the most important data structure for data analysis in R. Think of them as spreadsheets or database tables!

**Key characteristics:**
- **Rows**: Each row represents an observation (like a person, transaction, measurement)
- **Columns**: Each column represents a variable (like name, age, salary)
- **Mixed data types**: Different columns can contain different types of data
- **Same length**: All columns must have the same number of rows

**Why data frames are so important:**
- Most real-world data looks like this (think Excel spreadsheets, CSV files)
- Perfect for statistical analysis and visualization
- Most R functions are designed to work with data frames
- Easy to filter, sort, and manipulate

**In our example:**
- Each row represents one person
- Columns represent different attributes (ID, Name, Age)

In [17]:
# Data Frames (like a table or spreadsheet)
data_frame_example <- data.frame(
  ID = c(1, 2, 3),
  Name = c("John", "Jane", "Mike"),
  Age = c(25, 30, 35)
)
print(data_frame_example)

  ID Name Age
1  1 John  25
2  2 Jane  30
3  3 Mike  35


## Installing Packages (One-Time Setup)

**What are packages?** Think of packages as "apps" for R - they add new functions and capabilities.

**The `install.packages()` function:**
- Downloads and installs packages from the internet
- You only need to run this ONCE per computer
- That's why we've commented these out (they're already installed in this environment)

**Key packages we're using:**
- **`tidyverse`**: A collection of packages for data science (data cleaning, visualization, etc.)
- **`readxl`**: Specifically for reading Excel files (.xlsx, .xls)

**Important note:** If you're on your own computer and these packages aren't installed, you would uncomment these lines and run them once.

In [None]:
# 3. Installing and Loading Packages
# Packages extend R's functionality.
# The 'tidyverse' is a collection of packages for data science.

# Install if not already installed (run only once)
# install.packages("tidyverse")
# install.packages("readxl") # for reading Excel files



## Loading Packages with library()

Now we'll load the packages we need. Think of this like opening apps on your phone - the packages are installed, but we need to "open" them to use their functions.

**About our packages:**

**`tidyverse`** - A collection of packages that work well together for data science:
- Makes data cleaning and analysis much easier
- Includes tools for reading data, manipulating data, and creating visualizations
- Very popular in the R community

**`readxl`** - Specifically designed for reading Excel files:
- Can read both .xls and .xlsx files
- Handles Excel's complexities (multiple sheets, formatting, etc.)

**Important:** You only need to install packages once (`install.packages()`), but you need to load them with `library()` every time you start a new R session.

In [18]:
# Load packages for use in the current session
library(tidyverse)
library(readxl)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.2     [32m✔[39m [34mtibble   [39m 3.3.0
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.1.0     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


## Understanding Working Directory

The **working directory** is like R's "current location" on your computer. It's the folder where R will:
- Look for files when you try to read them
- Save files when you create them

**Key functions:**
- `getwd()`: Shows you where R is currently "looking"
- `setwd()`: Changes R's current location (use carefully!)

**Why this matters:**
- If your data file is in a different folder, R won't find it
- Understanding file paths prevents many common errors
- Good practice: organize your projects in folders

**Best practice:** Instead of changing working directory, use full file paths or organize your project properly from the start.

**Note about commented code:** The `setwd()` lines are commented out because:
1. Everyone's computer has different folder structures
2. It's better to use relative paths or organize projects properly
3. Changing working directory can cause confusion

In [20]:
# 4. Setting up Working Directory
# The working directory is where R will look for files and save files.

# Get current working directory
getwd()

# Set working directory (replace with your desired path)
setwd("/workspaces/Data-Management-Assignment-1-Intro-to-R/data")
# Verify the new working directory
 getwd()   

## Creating Sample Data and Reading CSV Files

For this lesson, we'll create a small sample CSV file to practice with. In real projects, you'd usually work with existing data files.

**What we're doing:**
1. Creating sample data as text (with headers and comma-separated values)
2. Writing it to a CSV file using `writeLines()`
3. Reading it back using `read_csv()` from the tidyverse

**About CSV files:**
- **CSV** = Comma-Separated Values
- Very common format for data exchange
- Can be opened in Excel, R, Python, databases, etc.
- Simple text format that's human-readable

**`read_csv()` vs base R's `read.csv()`:**
- `read_csv()` (tidyverse) is faster and more reliable
- Better at guessing data types
- Creates tibbles (enhanced data frames)

In [21]:
# 5. Importing Data from CSV and Excel files

# Create a dummy CSV file for demonstration
dummy_csv_data <- "ID,Product,Sales
1,Laptop,1200
2,Mouse,25
3,Keyboard,75
4,Monitor,300"
writeLines(dummy_csv_data, "dummy_sales_data.csv")

In [22]:
# Import CSV file using readr (part of tidyverse)
sales_data_csv <- read_csv("dummy_sales_data.csv")
print(sales_data_csv)


[1mRows: [22m[34m4[39m [1mColumns: [22m[34m3[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (1): Product
[32mdbl[39m (2): ID, Sales

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


[90m# A tibble: 4 × 3[39m
     ID Product  Sales
  [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m    [3m[90m<dbl>[39m[23m
[90m1[39m     1 Laptop    [4m1[24m200
[90m2[39m     2 Mouse       25
[90m3[39m     3 Keyboard    75
[90m4[39m     4 Monitor    300


## Working with Excel Files (Commented Code Explanation)

**Excel files (.xlsx, .xls)** are very common in business and research. The `readxl` package makes it easy to read them into R.

**Why the code is commented out:**
The next few cells show examples of working with Excel files, but they're commented out because:
1. We'd need to install additional packages (`openxlsx`) to create Excel files
2. For this demo, we're focusing on CSV files which are simpler
3. In real projects, you'd typically have Excel files provided to you

**Key functions for Excel:**
- `read_excel("filename.xlsx")`: Reads Excel files
- `write.xlsx()`: Creates Excel files (requires `openxlsx` package)
- Excel files can have multiple sheets, which `readxl` can handle

**When you'd use this:** In real projects when you receive data in Excel format from colleagues or clients.

In [None]:
# Create a dummy Excel file (requires openxlsx or readxl to write)
# For simplicity, we'll assume an Excel file exists or is created manually for the lesson.
# If you need to create one programmatically for testing, you'd use a package like 'openxlsx'.
# Example of creating a dummy Excel file (requires 'openxlsx' package)
# install.packages("openxlsx")
# library(openxlsx)
# dummy_excel_data <- data.frame(
#   OrderID = c(101, 102, 103),
#   Customer = c("Alice", "Bob", "Charlie"),
#   Amount = c(150.50, 200.00, 75.25)
# )
# write.xlsx(dummy_excel_data, "dummy_orders_data.xlsx")

In [None]:
# For the purpose of this example, let's simulate reading an Excel file
# Assume 'dummy_orders_data.xlsx' exists in the working directory
# orders_data_excel <- read_excel("dummy_orders_data.xlsx")
# print(orders_data_excel)


## Data Inspection: Your First Look at Data

Once you've loaded data, the first thing you should do is **inspect** it. This helps you understand:
- What variables (columns) you have
- How much data you're working with  
- What the data looks like
- If there are any obvious problems

**Key inspection functions we'll use:**
- `head()`: Shows first few rows
- `str()`: Shows structure and data types  
- `summary()`: Shows statistical summaries
- `View()`: Opens data in spreadsheet viewer (RStudio)

Let's explore our sales data step by step:

In [23]:
# 6. Basic Data Inspection

# View the first few rows
head(sales_data_csv)


ID,Product,Sales
<dbl>,<chr>,<dbl>
1,Laptop,1200
2,Mouse,25
3,Keyboard,75
4,Monitor,300


## The head() Function: Quick Data Preview

`head()` shows you the first few rows of your data (default is 6 rows). This is super useful because:

- **Quick overview**: See what your data looks like without overwhelming output
- **Column names**: Check if column names imported correctly  
- **Data types**: Get a sense of what type of data is in each column
- **Sample values**: See examples of actual data values

**Tip:** You can specify how many rows to show: `head(data, n = 10)` shows first 10 rows.

This is often the first thing data analysts do when they load new data!

In [24]:
# Get structure of the data frame
str(sales_data_csv)

spc_tbl_ [4 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ID     : num [1:4] 1 2 3 4
 $ Product: chr [1:4] "Laptop" "Mouse" "Keyboard" "Monitor"
 $ Sales  : num [1:4] 1200 25 75 300
 - attr(*, "spec")=
  .. cols(
  ..   ID = [32mcol_double()[39m,
  ..   Product = [31mcol_character()[39m,
  ..   Sales = [32mcol_double()[39m
  .. )
 - attr(*, "problems")=<externalptr> 


## The str() Function: Understanding Data Structure

`str()` (structure) gives you a comprehensive overview of your data's "anatomy":

**What str() tells you:**
- **Data type**: Is it a data frame, list, vector, etc.?
- **Dimensions**: How many rows and columns (observations and variables)?
- **Column types**: Are columns numeric, character, factor, etc.?
- **Sample values**: A few example values from each column

**Why this matters:**
- **Data types affect analysis**: You can't do math on text columns
- **Memory usage**: Helps you understand how much space your data uses
- **Data problems**: Quickly spot if numbers were read as text, etc.

**Example output interpretation:**
- `num`: Numeric (numbers you can do math with)
- `chr`: Character (text)
- `Factor`: Categorical data
- `logi`: Logical (TRUE/FALSE)

In [25]:
# Get summary statistics
summary(sales_data_csv)

       ID         Product              Sales       
 Min.   :1.00   Length:4           Min.   :  25.0  
 1st Qu.:1.75   Class :character   1st Qu.:  62.5  
 Median :2.50   Mode  :character   Median : 187.5  
 Mean   :2.50                      Mean   : 400.0  
 3rd Qu.:3.25                      3rd Qu.: 525.0  
 Max.   :4.00                      Max.   :1200.0  

## The summary() Function: Statistical Overview

`summary()` provides statistical summaries for each column in your data:

**For numeric columns, you get:**
- **Min**: Smallest value
- **1st Qu**: First quartile (25th percentile)  
- **Median**: Middle value (50th percentile)
- **Mean**: Average value
- **3rd Qu**: Third quartile (75th percentile)
- **Max**: Largest value

**For character/factor columns:**
- **Length**: How many values
- **Class**: What type of data
- **Mode**: Storage mode

**Why this is valuable:**
- **Spot outliers**: Extreme min/max values that might be errors
- **Understand distribution**: Is data skewed? Are there unusual patterns?
- **Missing data**: Shows if there are NA (missing) values
- **Data quality**: Quick check for reasonable ranges

**Example interpretation:** If you're looking at sales data and the minimum is negative, that might indicate a data entry error!

In [None]:
# View the entire dataset in a spreadsheet-like viewer (RStudio specific)
# View(sales_data_csv)

## The View() Function (Interactive Data Exploration)

**View() function:**
- Opens your data in a spreadsheet-like viewer (only works in RStudio)
- Great for exploring data interactively
- You can sort columns, filter, and browse through all your data
- **Note:** We've commented this out because it only works in interactive RStudio sessions, not in notebooks

**When to use View():**
- When you want to see your entire dataset
- To check for data quality issues visually
- To get familiar with large datasets
- To verify that data import worked correctly

In [None]:
# Clean up dummy file
file.remove("dummy_sales_data.csv")

## File Cleanup: Good Housekeeping

**File cleanup with file.remove():**
Since we created a temporary demo file for this lesson, we'll clean it up at the end. This is good practice:
- Keeps your workspace tidy
- Prevents confusion between demo files and real data
- Shows how to programmatically manage files in R

**In real projects:** You usually won't delete your data files - they're valuable! But removing temporary or intermediate files helps keep projects organized.

**The `file.remove()` function:**
- Deletes files from your computer
- Returns `TRUE` if successful, `FALSE` if the file doesn't exist
- Use with caution - deleted files can't be easily recovered!

## Lesson 1 Summary: What You've Learned

Congratulations! You've completed your first R lesson. Here's what you now know:

### 📝 **Basic R Concepts**
- How to create variables using `<-` and `=`
- Basic arithmetic operations (`+`, `-`, `*`, `/`)
- How to display results with `print()` and `paste()`

### 📊 **Data Structures**
- **Vectors**: Collections of the same data type `c(1, 2, 3)`
- **Factors**: For categorical data `factor(c("red", "green", "blue"))`
- **Lists**: Can hold mixed data types with names
- **Data Frames**: The foundation of data analysis (like spreadsheets)

### 📦 **Working with Packages**
- How to load packages with `library()`
- Introduction to `tidyverse` (data science toolkit)
- Using `readxl` for Excel files

### 📁 **File Management**
- Understanding working directories with `getwd()`
- Reading CSV files with `read_csv()`
- Basic file operations

### 🔍 **Data Inspection**
- `head()`: Preview first few rows
- `str()`: Understand data structure and types
- `summary()`: Get statistical summaries
- `View()`: Interactive data exploration (RStudio)

### 🎯 **Next Steps**
Now you're ready to:
- Work with real datasets
- Learn data manipulation and visualization
- Perform statistical analyses
- Build on these foundational concepts

**Practice tip:** Try these functions with different datasets to reinforce your learning!