# Lesson 1: Introduction to R and Data Import

Welcome to your first lesson in R programming! This lesson is designed for **complete beginners** with no prior coding experience.

---

## üéØ What You'll Learn Today

By the end of this lesson, you will be able to:
- Understand what programming and R are
- Create and use variables to store information
- Perform basic calculations
- Work with different types of data (numbers, text, lists)
- Load data from files for analysis

---

## ü§î What is Programming?

**Programming** is simply giving instructions to a computer. Think of it like writing a recipe:
- A recipe tells a person step-by-step how to cook a dish
- A program tells a computer step-by-step how to complete a task

**Why learn programming for data analysis?**
- Automate repetitive tasks (analyze 1000 files the same way you'd analyze 1)
- Handle large datasets that Excel can't manage
- Create reproducible analyses (you can re-run your exact steps)
- Perform advanced statistical analysis and visualization

---

## üíª What is R?

**R** is a programming language designed specifically for:
- Statistical analysis
- Data visualization (charts, graphs)
- Data manipulation and cleaning

**Why R?**
- Free and open-source
- Huge community and thousands of add-on packages
- Industry standard in many fields (academia, healthcare, finance)
- Excellent for beginners in data analysis

---

## üìì How to Use This Notebook

This is a **Jupyter Notebook** - an interactive document where you can:

1. **Read explanations** (like this text)
2. **Run code** by clicking a cell and pressing `Shift + Enter`
3. **See results** immediately below each code cell

**üîë Key tip:** Run each code cell in order, from top to bottom. Later cells often depend on earlier ones!

---

## üîß Understanding Code Cells

In the code cells below, you'll see lines starting with `#`. These are called **comments**.

```r
# This is a comment - R ignores it completely
x = 10  # Comments can also go at the end of a line
```

**Why use comments?**
- Explain what your code does (for yourself and others)
- Leave notes and reminders
- Temporarily disable code without deleting it

Comments are for humans, not the computer!

---

## üì¶ Variables: Storing Information

### What is a Variable?

A **variable** is like a labeled container that holds information. Just like you might label a box "Kitchen Supplies" to remember what's inside, variables have names that help you remember what data they contain.

**Real-world analogy:**
| Real World | Programming |
|------------|-------------|
| A jar labeled "Sugar" containing sugar | A variable named `sugar` containing the value `2` (cups) |
| A folder labeled "Receipts" with papers inside | A variable named `receipts` containing data |

### Creating Variables in R

We use the `=` sign to **assign** a value to a variable name:

```r
variable_name = value
```

**Important naming rules:**
- ‚úÖ Start with a letter: `my_data`, `sales2024`
- ‚úÖ Can contain letters, numbers, underscores: `total_count`, `item1`
- ‚ùå Cannot start with a number: `2sales` won't work
- ‚ùå No spaces allowed: `my data` won't work (use `my_data` instead)
- ‚ùå Avoid special characters: `my$data` won't work

**üí° Tip:** Use descriptive names! `customer_age` is better than `x`

Let's create our first variables:

In [None]:
# Creating variables - run this cell with Shift+Enter!

# Store a number in a variable called 'x'
x = 10

# Store another number in a variable called 'y'  
y = 5

# Store text (called a "string") - notice the quotation marks!
greeting = "Hello, welcome to R!"

# Store a decimal number
price = 19.99

### ‚ùì Did anything happen?

When you run the cell above, you might notice... nothing appears! That's normal. 

**Why?** Creating a variable stores the value silently. To *see* what's stored, we need to explicitly ask R to show us.

---

## üëÅÔ∏è Displaying Values with print()

The `print()` function tells R: "Show me this value!"

**Syntax:** `print(what_you_want_to_see)`

This is useful for:
- ‚úÖ Checking what's stored in a variable
- ‚úÖ Displaying results of calculations
- ‚úÖ Debugging (finding problems in your code)

**üí° Shortcut:** In interactive mode, just typing the variable name also shows its value, but `print()` is clearer and works everywhere.

In [None]:
# Display the values we stored earlier
print(x)
print(y)
print(greeting)
print(price)

# You can also just type the variable name (try it!)
x

### ‚úèÔ∏è Try It Yourself!

In the cell below, practice creating your own variables:
1. Create a variable called `my_age` with your age
2. Create a variable called `my_name` with your name (remember the quotes!)
3. Use `print()` to display both values

In [None]:
# YOUR TURN! Create your own variables here:

# 1. Create a variable for your age


# 2. Create a variable for your name (use quotes around text)


# 3. Print both values

---

## üßÆ R as a Calculator: Arithmetic Operations

One of the simplest things you can do with R is math! R works just like a calculator.

### Basic Math Operators

| Operator | Meaning | Example | Result |
|----------|---------|---------|--------|
| `+` | Addition | `5 + 3` | `8` |
| `-` | Subtraction | `10 - 4` | `6` |
| `*` | Multiplication | `6 * 7` | `42` |
| `/` | Division | `20 / 4` | `5` |
| `^` | Exponent (power) | `2^3` | `8` (2√ó2√ó2) |
| `%%` | Remainder (modulo) | `17 %% 5` | `2` (17√∑5 = 3 remainder 2) |

### Order of Operations

R follows standard math rules (PEMDAS/BODMAS):
1. **P**arentheses first: `(2 + 3) * 4` = `20`
2. **E**xponents: `2^3` = `8`
3. **M**ultiplication and **D**ivision (left to right)
4. **A**ddition and **S**ubtraction (left to right)

**üí° Tip:** When in doubt, use parentheses to make your intentions clear!

Let's practice storing calculation results in variables:

In [None]:
# Using our variables x=10 and y=5 from earlier

# Addition: add x and y, store the result
sum_xy = x + y

# Subtraction: subtract y from x
diff_xy = x - y

# Multiplication: multiply x and y
prod_xy = x * y

# Division: divide x by y
div_xy = x / y

# Let's also try some direct calculations
area_of_rectangle = 15 * 8      # width times height
average_score = (85 + 92 + 78) / 3  # add scores, divide by count

### Displaying Results with Context

Just printing a number like `15` isn't very helpful. What does 15 mean? 

The `paste()` function lets us combine text and numbers into meaningful messages:

```r
paste("The answer is", my_variable)
```

This creates output like: `"The answer is 42"` - much more informative!

In [None]:
# Display our calculation results with helpful labels
print(paste("x =", x, "and y =", y))
print(paste("Sum (x + y):", sum_xy))
print(paste("Difference (x - y):", diff_xy))
print(paste("Product (x * y):", prod_xy))
print(paste("Division (x / y):", div_xy))

# Display our practical examples
print(paste("Area of rectangle:", area_of_rectangle, "square units"))
print(paste("Average test score:", average_score))

### ‚ö†Ô∏è Common Beginner Mistakes (and How to Fix Them!)

Don't worry - everyone makes these mistakes when starting out!

| Mistake | What You Wrote | Error You'll See | How to Fix |
|---------|----------------|------------------|------------|
| Forgetting quotes around text | `name = John` | `Error: object 'John' not found` | Use quotes: `name = "John"` |
| Typo in variable name | `print(naem)` | `Error: object 'naem' not found` | Check spelling: `print(name)` |
| Using a variable before creating it | `print(z)` (but z doesn't exist) | `Error: object 'z' not found` | Create the variable first: `z = 100` |
| Missing parenthesis | `print(x` | `Error: unexpected end of input` | Close the parenthesis: `print(x)` |
| Using wrong quotes | `name = "John'` | `Error: unexpected end of input` | Match your quotes: `"John"` or `'John'` |

**üí° Tip:** Read error messages carefully - they often tell you exactly what's wrong!

---

## üìã Data Types: What Kind of Information?

Before we go further, let's understand the types of data R can work with:

| Data Type | What It Is | Examples | R Name |
|-----------|------------|----------|--------|
| **Numbers** | Any numerical value | `42`, `3.14`, `-10` | `numeric` |
| **Text** | Words, sentences (in quotes) | `"Hello"`, `"John Smith"` | `character` |
| **True/False** | Logical values | `TRUE`, `FALSE` | `logical` |
| **Missing** | No data available | `NA` | `NA` |

**üí° Key Point:** Text MUST be in quotes (`"like this"`), numbers should NOT be in quotes.

- `"42"` = text (you can't do math with it)
- `42` = number (you can do math with it)

---

## üìä Vectors: Your First Data Structure

### What is a Vector?

So far, we've stored single values. But what if you have multiple values? That's where **vectors** come in!

A **vector** is like a shopping list - multiple items stored together under one name.

**Real-world analogy:**
| Shopping List | R Vector |
|--------------|----------|
| Eggs, Milk, Bread, Butter | `c("Eggs", "Milk", "Bread", "Butter")` |

### Creating Vectors with c()

We use the `c()` function to **combine** values into a vector. Think of "c" as "combine" or "collect".

```r
my_vector = c(value1, value2, value3, ...)
```

**Important:** All values in a vector should be the same type (all numbers OR all text).

In [None]:
# Creating vectors - collections of values

# A vector of numbers (test scores)
test_scores = c(85, 92, 78, 95, 88)

# A vector of text (student names)
student_names = c("Alice", "Bob", "Carol", "David", "Eva")

# A vector of prices
prices = c(10.99, 25.50, 8.75, 15.00, 32.99)

# A vector of logical values (did students pass?)
passed = c(TRUE, TRUE, FALSE, TRUE, TRUE)

### Viewing Your Vectors

Notice how R displays vectors - it shows all values and includes `[1]` at the start. This `[1]` indicates the position (index) of the first element shown.

In [None]:
# Print our vectors
print("Test Scores:")
print(test_scores)

print("Student Names:")
print(student_names)

print("Prices:")
print(prices)

### üî¢ Useful Vector Functions

R has built-in functions to analyze vectors quickly:

| Function | What It Does | Example |
|----------|--------------|---------|
| `length()` | Count how many items | `length(test_scores)` ‚Üí `5` |
| `sum()` | Add all numbers together | `sum(test_scores)` ‚Üí `438` |
| `mean()` | Calculate the average | `mean(test_scores)` ‚Üí `87.6` |
| `min()` | Find the smallest value | `min(test_scores)` ‚Üí `78` |
| `max()` | Find the largest value | `max(test_scores)` ‚Üí `95` |

In [None]:
# Let's analyze our test scores vector
print(paste("Number of students:", length(test_scores)))
print(paste("Total of all scores:", sum(test_scores)))
print(paste("Average score:", mean(test_scores)))
print(paste("Lowest score:", min(test_scores)))
print(paste("Highest score:", max(test_scores)))

## Factors: For Categorical Data

**Factors** are R's way of handling categorical data - data that falls into specific categories or groups.

**When to use factors:**
- Colors (red, green, blue)
- Survey responses (Agree, Neutral, Disagree) 
- Categories (Small, Medium, Large)
- Gender (Male, Female, Other)

**Why use factors instead of regular text?**
- R knows these are categories, not just random text
- More memory efficient
- Enables proper statistical analysis
- Controls the order of categories (useful for charts and analysis)

**Note:** When you print a factor, R shows you the values AND the "Levels" (all possible categories).

In [None]:
# Factors (for categorical data)
colors = factor(c("red", "green", "blue", "red"))
print(colors)

## Lists: Mixed Data Types

**Lists** are like containers that can hold different types of data together. Unlike vectors, lists are very flexible!

**What makes lists special:**
- Can contain different data types in the same structure
- Can hold numbers, text, vectors, even other lists!
- Elements can be named (like in our example: "Name", "Age", "Numbers")
- Very useful for organizing complex data

**Think of a list like a file folder:** 
- Each item has a label (the name)
- Each item can contain different types of content
- You can access items by their name or position

This makes lists perfect for storing related information that might be different types (like a person's name, age, and favorite numbers).

In [None]:
# Lists (can contain different data types)
my_list = list("Name" = "Alice", "Age" = 30, "Numbers" = c(1, 2, 3))
print(my_list)

---

## üìä Data Frames: The Star of Data Analysis

### What is a Data Frame?

A **data frame** is like a spreadsheet or Excel table - it has rows and columns!

| | Name | Age | Salary |
|---|------|-----|--------|
| Row 1 | Alice | 28 | 55000 |
| Row 2 | Bob | 35 | 62000 |
| Row 3 | Carol | 42 | 78000 |

**Key characteristics:**
- **Rows** = individual observations (each person, each transaction, each record)
- **Columns** = variables/attributes (name, age, salary)
- Each column can have a different data type (text, numbers, dates)
- All columns must have the same number of rows

**Why data frames are so important:**
- This is how real data looks! (CSV files, databases, Excel exports)
- Most R functions work with data frames
- Perfect for analysis, filtering, and visualization

### Creating a Data Frame

We use `data.frame()` and define each column:

In [None]:
# Creating a data frame - like making a mini spreadsheet!
employees = data.frame(
  Name = c("Alice", "Bob", "Carol"),      # Column 1: text
  Age = c(28, 35, 42),                     # Column 2: numbers
  Salary = c(55000, 62000, 78000),         # Column 3: numbers
  Department = c("Sales", "IT", "HR")      # Column 4: text
)

# Display the data frame
print("Employee Data:")
print(employees)

## Installing Packages (One-Time Setup)

**What are packages?** Think of packages as "apps" for R - they add new functions and capabilities.

**The `install.packages()` function:**
- Downloads and installs packages from the internet
- You only need to run this ONCE per computer
- That's why we've commented these out (they're already installed in this environment)

**Key packages we're using:**
- **`tidyverse`**: A collection of packages for data science (data cleaning, visualization, etc.)
- **`readxl`**: Specifically for reading Excel files (.xlsx, .xls)

**Important note:** If you're on your own computer and these packages aren't installed, you would uncomment these lines and run them once.

In [None]:
# 3. Installing and Loading Packages
# Packages extend R's functionality.
# The 'tidyverse' is a collection of packages for data science.

# Install if not already installed (run only once)
# install.packages("tidyverse")
# install.packages("readxl") # for reading Excel files



## Loading Packages with library()

Now we'll load the packages we need. Think of this like opening apps on your phone - the packages are installed, but we need to "open" them to use their functions.

**About our packages:**

**`tidyverse`** - A collection of packages that work well together for data science:
- Makes data cleaning and analysis much easier
- Includes tools for reading data, manipulating data, and creating visualizations
- Very popular in the R community

**`readxl`** - Specifically designed for reading Excel files:
- Can read both .xls and .xlsx files
- Handles Excel's complexities (multiple sheets, formatting, etc.)

**Important:** You only need to install packages once (`install.packages()`), but you need to load them with `library()` every time you start a new R session.

In [None]:
# Load packages for use in the current session
library(tidyverse)
library(readxl)

## Understanding Working Directory

The **working directory** is like R's "current location" on your computer. It's the folder where R will:
- Look for files when you try to read them
- Save files when you create them

**Key functions:**
- `getwd()`: Shows you where R is currently "looking"
- `setwd()`: Changes R's current location (use carefully!)

**Why this matters:**
- If your data file is in a different folder, R won't find it
- Understanding file paths prevents many common errors
- Good practice: organize your projects in folders

**Best practice:** Instead of changing working directory, use full file paths or organize your project properly from the start.

**Note about commented code:** The `setwd()` lines are commented out because:
1. Everyone's computer has different folder structures
2. It's better to use relative paths or organize projects properly
3. Changing working directory can cause confusion

In [None]:
# 4. Setting up Working Directory
# The working directory is where R will look for files and save files.

# Get current working directory
getwd()

# Set working directory (replace with your desired path)
setwd("/workspaces/Data-Management-Assignment-1-Intro-to-R/data")
# Verify the new working directory
 getwd()   

## Creating Sample Data and Reading CSV Files

For this lesson, we'll create a small sample CSV file to practice with. In real projects, you'd usually work with existing data files.

**What we're doing:**
1. Creating sample data as text (with headers and comma-separated values)
2. Writing it to a CSV file using `writeLines()`
3. Reading it back using `read_csv()` from the tidyverse

**About CSV files:**
- **CSV** = Comma-Separated Values
- Very common format for data exchange
- Can be opened in Excel, R, Python, databases, etc.
- Simple text format that's human-readable

**`read_csv()` vs base R's `read.csv()`:**
- `read_csv()` (tidyverse) is faster and more reliable
- Better at guessing data types
- Creates tibbles (enhanced data frames)

In [None]:
# 5. Importing Data from CSV and Excel files

# Create a dummy CSV file for demonstration
dummy_csv_data = "ID,Product,Sales
1,Laptop,1200
2,Mouse,25
3,Keyboard,75
4,Monitor,300"
writeLines(dummy_csv_data, "dummy_sales_data.csv")

In [None]:
# Import CSV file using readr (part of tidyverse)
sales_data_csv = read_csv("dummy_sales_data.csv")
print(sales_data_csv)


## Working with Excel Files (Commented Code Explanation)

**Excel files (.xlsx, .xls)** are very common in business and research. The `readxl` package makes it easy to read them into R.

**Why the code is commented out:**
The next few cells show examples of working with Excel files, but they're commented out because:
1. We'd need to install additional packages (`openxlsx`) to create Excel files
2. For this demo, we're focusing on CSV files which are simpler
3. In real projects, you'd typically have Excel files provided to you

**Key functions for Excel:**
- `read_excel("filename.xlsx")`: Reads Excel files
- `write.xlsx()`: Creates Excel files (requires `openxlsx` package)
- Excel files can have multiple sheets, which `readxl` can handle

**When you'd use this:** In real projects when you receive data in Excel format from colleagues or clients.

In [None]:
# Create a dummy Excel file (requires openxlsx or readxl to write)
# For simplicity, we'll assume an Excel file exists or is created manually for the lesson.
# If you need to create one programmatically for testing, you'd use a package like 'openxlsx'.
# Example of creating a dummy Excel file (requires 'openxlsx' package)
# install.packages("openxlsx")
# library(openxlsx)
# dummy_excel_data <- data.frame(
#   OrderID = c(101, 102, 103),
#   Customer = c("Alice", "Bob", "Charlie"),
#   Amount = c(150.50, 200.00, 75.25)
# )
# write.xlsx(dummy_excel_data, "dummy_orders_data.xlsx")

In [None]:
# For the purpose of this example, let's simulate reading an Excel file
# Assume 'dummy_orders_data.xlsx' exists in the working directory
# orders_data_excel <- read_excel("dummy_orders_data.xlsx")
# print(orders_data_excel)


## Data Inspection: Your First Look at Data

Once you've loaded data, the first thing you should do is **inspect** it. This helps you understand:
- What variables (columns) you have
- How much data you're working with  
- What the data looks like
- If there are any obvious problems

**Key inspection functions we'll use:**
- `head()`: Shows first few rows
- `str()`: Shows structure and data types  
- `summary()`: Shows statistical summaries
- `View()`: Opens data in spreadsheet viewer (RStudio)

Let's explore our sales data step by step:

In [None]:
# 6. Basic Data Inspection

# View the first few rows
head(sales_data_csv)


## The head() Function: Quick Data Preview

`head()` shows you the first few rows of your data (default is 6 rows). This is super useful because:

- **Quick overview**: See what your data looks like without overwhelming output
- **Column names**: Check if column names imported correctly  
- **Data types**: Get a sense of what type of data is in each column
- **Sample values**: See examples of actual data values

**Tip:** You can specify how many rows to show: `head(data, n = 10)` shows first 10 rows.

This is often the first thing data analysts do when they load new data!

In [None]:
# Get structure of the data frame
str(sales_data_csv)

## The str() Function: Understanding Data Structure

`str()` (structure) gives you a comprehensive overview of your data's "anatomy":

**What str() tells you:**
- **Data type**: Is it a data frame, list, vector, etc.?
- **Dimensions**: How many rows and columns (observations and variables)?
- **Column types**: Are columns numeric, character, factor, etc.?
- **Sample values**: A few example values from each column

**Why this matters:**
- **Data types affect analysis**: You can't do math on text columns
- **Memory usage**: Helps you understand how much space your data uses
- **Data problems**: Quickly spot if numbers were read as text, etc.

**Example output interpretation:**
- `num`: Numeric (numbers you can do math with)
- `chr`: Character (text)
- `Factor`: Categorical data
- `logi`: Logical (TRUE/FALSE)

In [None]:
# Get summary statistics
summary(sales_data_csv)

## The summary() Function: Statistical Overview

`summary()` provides statistical summaries for each column in your data:

**For numeric columns, you get:**
- **Min**: Smallest value
- **1st Qu**: First quartile (25th percentile)  
- **Median**: Middle value (50th percentile)
- **Mean**: Average value
- **3rd Qu**: Third quartile (75th percentile)
- **Max**: Largest value

**For character/factor columns:**
- **Length**: How many values
- **Class**: What type of data
- **Mode**: Storage mode

**Why this is valuable:**
- **Spot outliers**: Extreme min/max values that might be errors
- **Understand distribution**: Is data skewed? Are there unusual patterns?
- **Missing data**: Shows if there are NA (missing) values
- **Data quality**: Quick check for reasonable ranges

**Example interpretation:** If you're looking at sales data and the minimum is negative, that might indicate a data entry error!

In [None]:
# View the entire dataset in a spreadsheet-like viewer (RStudio specific)
# View(sales_data_csv)

## The View() Function (Interactive Data Exploration)

**View() function:**
- Opens your data in a spreadsheet-like viewer (only works in RStudio)
- Great for exploring data interactively
- You can sort columns, filter, and browse through all your data
- **Note:** We've commented this out because it only works in interactive RStudio sessions, not in notebooks

**When to use View():**
- When you want to see your entire dataset
- To check for data quality issues visually
- To get familiar with large datasets
- To verify that data import worked correctly

In [None]:
# Clean up dummy file
file.remove("dummy_sales_data.csv")

## File Cleanup: Good Housekeeping

**File cleanup with file.remove():**
Since we created a temporary demo file for this lesson, we'll clean it up at the end. This is good practice:
- Keeps your workspace tidy
- Prevents confusion between demo files and real data
- Shows how to programmatically manage files in R

**In real projects:** You usually won't delete your data files - they're valuable! But removing temporary or intermediate files helps keep projects organized.

**The `file.remove()` function:**
- Deletes files from your computer
- Returns `TRUE` if successful, `FALSE` if the file doesn't exist
- Use with caution - deleted files can't be easily recovered!

---

## üéì Lesson 1 Summary: What You've Learned

Congratulations! You've completed your first R lesson. Here's what you now know:

### ü§î **Big Picture Concepts**
- What programming is (giving step-by-step instructions to a computer)
- Why R is great for data analysis
- How to use Jupyter notebooks (read, run code, see results)
- What comments are (`#`) and why they're useful

### üì¶ **Variables**
- Variables store information for later use
- Create them with `=`: `my_variable = value`
- Text needs quotes: `name = "Alice"`
- Numbers don't: `age = 25`
- Use descriptive names!

### üßÆ **Arithmetic**
- Basic math: `+`, `-`, `*`, `/`, `^`
- Store results in variables: `result = 10 + 5`
- Display results with `print()` and `paste()`

### üìä **Data Structures**
| Structure | What It Is | Example |
|-----------|------------|---------|
| **Vector** | List of same-type values | `c(1, 2, 3, 4, 5)` |
| **Factor** | Categorical data | `factor(c("red", "green"))` |
| **List** | Mixed data types | `list(name="Alice", age=25)` |
| **Data Frame** | Table/spreadsheet | `data.frame(Name=..., Age=...)` |

### üî¢ **Useful Functions**
| Function | Purpose |
|----------|---------|
| `print()` | Display a value |
| `paste()` | Combine text and values |
| `c()` | Create a vector |
| `length()`, `sum()`, `mean()` | Analyze vectors |
| `head()`, `str()`, `summary()` | Inspect data frames |

### üì¶ **Packages & Files**
- Load packages with `library(tidyverse)`
- Check location with `getwd()`
- Read CSV files with `read_csv("filename.csv")`

### ‚ö†Ô∏è **Common Errors to Watch For**
- Forgetting quotes around text
- Typos in variable names
- Missing parentheses
- Using variables before creating them

---

## üöÄ What's Next?

In the next lessons, you'll learn to:
- Filter and transform data
- Create visualizations (charts and graphs)
- Perform statistical analysis
- Work with real-world datasets

**Keep practicing!** The more you code, the more natural it becomes. Don't be afraid to make mistakes - that's how everyone learns! üí™