<h2>Trabalhando com Dados em R</h2>

<h3>Primeiros passos: Instalar pacotes</h3>

In [None]:
install.packages(c("readxl", "jsonlite"))

library(readxl)
library(jsonlite)

<h3>Lendo arquivos .csv</h3>

In [10]:
finance_data <- read.csv("_f733a70b3d11467ca6c77f726ac01bd2_C1_M3_L1_Finance_insurance_data.csv")

### Argumentos da função `read.csv()`

* header: TRUE if the file's first row contains column labels (like "Name" or "Age"); FALSE if it's only data.

* sep: Usually commas, but using tabs ("\t") or semicolons (";") might be necessary.

* stringsAsFactors: Choose FALSE to keep R from automatically converting text into categorical data ("factors"), giving you more control.

<h3>Lendo arquivos .xls ou .xlsx</h3>

In [11]:
install.packages(c("readxl", "jsonlite"))
library(readxl)

bioinformatics_dataset <- read_excel(_)

ERROR: Error in bioinformatics_dataset <- read_excel("_"): invalid use of pipe placeholder (<input>:4:0)


### Argumentos da função `read_excel()`

* sheet: Specify which worksheet you want (default is automatically the first one). Example: sheet = "Sales_2023"

* range: Choose a range of cells to import. For example, range = "A1:D20" imports just those specific cells.

* Expert tip: Many Excel files contain multiple sheets. If you import various sheets often, explore writing your function or loop to automate that repetitive task.

### Lendo arquivos .json

In [None]:
install.packages(c("readxl", "jsonlite"))
library(jsonlite)

healthcare_dataset <- fromJSON(_8c3152238fa54a45b94dff54a9ea16ae_C1_M3_L1_healthcare_bioinformation.json)

### Argumentos da função `fromJSON()`

* simplifyDataFrame: Set TRUE (the default) to immediately format JSON data into an easy-to-use data frame.

* flatten: Use TRUE to simplify nested JSON structures, making it much easier to work directly with your data in R.

* Expert tip: Because JSON format varies depending on the data included, always run a quick test import first. Afterwards, use str(user_data) in R to see the structure and check whether additional data reshaping or simplifying is needed. 

### Uso de `trycatch()` para detecção e correção de erros

In [None]:
# Example of using tryCatch() for error handling
tryCatch({
    data <- read.csv("data/non_existent_file.csv")
}, warning = function(w) {
    message("Warning: ", conditionMessage(w))
}, error = function(e) {
    message("Error: ", conditionMessage(e))
    # Additional error-handling code
}, finally = {
    message("Attempted the data import.")
})

### Exportando dados .csv, .json, .xls/.xlsx e .pdf com R

* ####  .csv

In [None]:
# Example: Export data to CSV
write.csv(data, file = "output.csv", row.names = FALSE)

* #### .json


In [None]:
library(jsonlite)  # Example: Export data to JSON
write_json(data, "output.json")

* #### .xls

In [None]:
library(openxlsx)  # Example: Export data to Excel
write.xlsx(data, "output.xlsx")

* #### .pdf

In [None]:
library(rmarkdown)  # Example: Export document to PDF
render("report.Rmd", output_format = "pdf_document")

### Data Preservation

* Check Data Types: Ensure data types are interpreted correctly. Dates should remain dates, not be converted to strings

In [None]:
# Example: Ensure date columns remain dates
data$date_column <- as.Date(data$date_column)

* Handle Missing Values: Decide how to represent missing values. For example, in CSV files, you might use an empty string or a specific placeholder like "NA".

In [None]:
# Example: Specify missing values in CSV
write.csv(data, "output.csv", na = "")

* Maintain Column Order: Ensure the order of columns in your dataset is preserved in the exported file, so your data remains structured and interpretable.

### Encoding Issues

*   Use UTF-8 Encoding: UTF-8 is a broadly accepted encoding standard that supports a wide range of characters. It is the recommended default because it helps prevent issues with text data losing special characters or being misinterpreted during export. This aligns with your data import practices, where you might have ensured that special characters were handled correctly.

In [None]:
# Example: Export CSV with UTF-8 encoding
write.csv(data, file = "output.csv", fileEncoding = "UTF-8")

* Specify Encoding in Export Functions: Many R functions allow you to specify encoding. Always setting it explicitly helps avoid unexpected surprises later when others use the data.

*  Test with Sample Data: Before exporting a large dataset, especially one with complicated encoding requirements, export a small sample to verify everything is working correctly. This ties into your broader practice of inspecting small data samples before processing the entire dataset. Additionally, verifying data integrity by re-importing the exported data can ensure no critical information was lost or altered.

*  Real-world example: Imagine you are exporting customer names who come from diverse international backgrounds. You can create a small sample with various characters, export it, and check the output to ensure that all characters are preserved accurately.

### Fundamentos de Manipulação de Dados

#### Instalar pacotes e carregar biblioteca dplyr

In [None]:
install.packages("dplyr")
library(dplyr)

#### Selection Methods

In [None]:
install.packages("dplyr")
library(dplyr)

# Select the columns named Title and Salary from the employee_data dataset
selected_data <- select(employee_data, Title, Salary)

# Select using %>%
selected_data <- employee_data %>% select(Title,Salary)

# Display the selected columns
print(selected_data)

Often, you won't immediately know your column names. A helpful way to quickly choose columns is by using helper functions. For example:

In [None]:
starts_with("price") # will select all columns whose names begin with "price."
ends_with("_score") # will select columns that end with "_score."
contains("amount") # will pick out columns containing the word "amount."

#### Filtering Techniques

In [None]:
# Select only the rows of "employee_data" where Salary is greater than 50000
filtered_data <- filter(employee_data, Salary > 50000)

# Filter using %>%
filtered_data <- filter %>% ( Salary > 50000)

# Display the filtered rows
print(filtered_data)

* & (and): Both conditions must be true (filter(dataset, Salary > 50000 & Years_at_Company > 5)).

* | (or): Just one condition needs to be met (filter(dataset, Department == "Sales" | Department == "Marketing")).

* ! (not): Exclude rows matching your condition (filter(dataset, !Department == "Finance")).

### Verificar se qualquer coluna cumpre um filtro:

In [None]:
any(employee_data$Salary > 100000)

### Transformar dados

In [None]:
# Create a new column "kpg" based on existing column "mpg"
transformed_data <- mutate(mtcars, kpg = mpg * 1.6)

# Mutate using %>%
transformed_data <- mtcatrs %>% mutate(kpg = mpg * 1.6)

# Display updated data
print(transformed_data)

In [None]:
mutate(salary_group = case_when(
  Salary < 40000 ~ "Low",
  Salary < 60000 ~ "Medium",
  TRUE ~ "High"
))

In [None]:
mutate(experienced = if_else(Years_at_Company >= 5, "Yes", "No"))

### Padrões comuns em manipulação de dados

* #### Chaining Operations Clearly:

In [None]:
# Selecting columns, filtering rows, and then adding a new column—all chained into one easy-to-understand pipeline
result <- mtcars %>%
    select(mpg, cyl) %>%
    filter(mpg > 20) %>%
    mutate(kpg = mpg * 1.6)

print(result)

* #### Handling Missing Data (Empty Spots in Your Dataset)

In [None]:
# Removes rows with missing values (NA) from the dataset
clean_data <- na.omit(your_dataset)

Want a quick way to count missing values across columns? Use colSums(is.na(your_dataset)).

This command is a quick diagnostic tool. It checks for missing values (NA) in every column and totals them up using colSums(). It’s extremely useful when cleaning your data—you’ll instantly see which parts of your dataset might need attention before running analyses.

#### `subset()` — A Shortcut for Filtering and Selecting

In [None]:
subset(product_data, Category == "Electronics", select = c(Name, Price))

This line says:

* Filter rows where Category is "Electronics"

* Return only the Name and Price columns

It's concise and readable, but also a little rigid.

`subset()` is: 

* Great for quick, one-off filtering when exploring data interactively

* Can behave unpredictably in more complex scripts (especially inside custom functions)

* Not part of tidyverse—so it doesn't work smoothly with pipes (%>%)

#### merge() — Joining Two Data Frames by a Shared Key

Now, say you're analyzing employee performance. You receive two separate files:

* employee_info — with columns EmployeeID, Name, and Department

* performance_scores — with columns EmployeeID and Score

To combine these into one dataset that includes names and scores side by side, you’d use merge():

In [None]:
merged_data <- merge(employee_info, performance_scores, by = "EmployeeID")

### Média (mean), Mediana (median) e Moda (Mode)

* Mean:

In [None]:
vacation_days <- c(10, 12, 8, 9, 40)  # One employee took 40 days off
mean(vacation_days)  # Returns 15.8

* Median:

In [None]:
median(vacation_days)  # Returns 10

* Mode:

R doesn’t include a built-in mode function, but you can write one.

In [None]:
vacation_days <- c(10, 12, 8, 9, 10, 10)

get_mode <- function(x) {
  uniqx <- unique(x)  # Finds all unique values in the dataset
  uniqx[which.max(tabulate(match(x, uniqx)))]  # Finds which value appears most often
}
get_mode(vacation_days)  # Returns 10

### Variância e Desvio Padrão

* Variance:

In [None]:
variance <- (sd(dataset))²

* Standard Deviation:

In [None]:
sd(dataset)

### `group_by()` e `summarize()`

In [12]:
library(dplyr)

# Sample data: Recovery days for different treatments
recovery_data <- data.frame(
  treatment = c("A", "A", "B", "B", "B", "C"),
  days = c(10, 12, 9, 15, 14, 8)
)

# Group by treatment type, then summarize
recovery_data %>%
  group_by(treatment) %>%
  summarize(avg_days = mean(days))

treatment,avg_days
<chr>,<dbl>
A,11.0
B,12.66667
C,8.0


### Correlation e Covariance

`cor()` returns a value from -1 to +1

* A value near 0 means no relationship.

* A value near +1 means a strong positive relationship.

* A value near -1 means a strong negative relationship.

Example: You work for a fitness tech company, analyzing user data to see if more daily steps lead to better sleep quality. You would use this formula to check for a correlation between steps and sleep quality.

In [None]:
cor(steps, sleep_quality)

Covariance

This is calculated with `cov()` and measures the direction of the relationship, but not its strength. It's useful when comparing raw values, but less interpretable across datasets.

In [None]:
cov(income, spending)

### Modeling Relationships: Linear Regression

Linear Regression (lm)

The `lm()` function in R lets you build a model that predicts a numeric outcome based on one or more inputs.

Example: You’re analyzing a dataset of employees and want to predict salary based on years of experience. Your data set is the following: salary_data.csv. Your goal is to predict the salary of an employee based on how many years they’ve worked. You would type the following:

In [None]:
salary_data <- read.csv("salary_data.csv")
salary_model <- lm(Salary ~ Experience, data = salary_data)
summary(salary_model)

### Testing Hypotheses

* `t.test()` – Tests if two group means are significantly different.

* `chisq.test()` – Tests the relationship between categorical variables.

* `wilcox.test()` – Non-parametric version of t-test for skewed or small data.

These functions return a p-value, which tells you the likelihood of observing your result if there were no real effect. A p-value under 0.05 often indicates that your result is statistically significant.