Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
133 lines (89 sloc) 3.46 KB
## INTRO TO USING R FOR DATA EXPLORATION AND ANALYSIS
## TITANIC DATASET - https://www.kaggle.com/c/titanic/data
## JAGATH JAI KUMAR
# Getting started with R
# https://www.r-project.org/
# https://www.rstudio.com/
# https://www.tidyverse.org/
# Run once, installed 4 lyfe
install.packages("tidyverse")
library(tidyverse)
# ggplot2
# readr
# dplyr
# tibble
# MAGRITTR
# Some conventions
we.name.things.like.this <- 10
we.assign.things.with.goofy.arrows <- 15
i.know.its.silly.but.dont.worry.about.it <- 12
i.know.its.silly.but.dont.worry.about.it
# Let's get started
# print dir
dir()
#change dir
setwd("~/Desktop/TitanicExploration")
titanic.data <- read_csv("train.csv")
head(titanic.data)
head(titanic.data, n = 10)
summary(titanic.data)
# take summary with a grain of salt
# Remember, the average human has 1 breast and 1 testicle
str(titanic.data)
# Some notes
# int must be an interger, numerics can be decimal values
# chr is short for character string
# Notice that some of these values are kind of weird...
# Things like Pclass and Survived are better as categories, not numbers
# Be careful with values represented by numbers, they imply a value and sometimes a hierarchy
# For example, passenger class 3 is not necessarily 3x the value of passenger class 1
# To fix this, we can assign columns to be FACTORS (extremely useful!)
#Converting Survived to a factor
titanic.data$Survived <- factor(titanic.data$Survived)
#Converting Pclass to a factor
titanic.data$Pclass <- factor(titanic.data$Pclass)
#Converting Sex to a factor
titanic.data$Sex <- factor(titanic.data$Sex)
#Converting SibSp to a factor
titanic.data$SibSp <- factor(titanic.data$SibSp)
#Converting Parch to a factor
titanic.data$Parch <- factor(titanic.data$Parch)
#Converting Embarked to a factor
titanic.data$Embarked <- factor(titanic.data$Embarked, ordered = FALSE)
str(titanic.data)
# Much better!!
# There's a lot of cleaning to be done with this dataset!
# One example of this is the Name column
head(titanic.data$Name)
# Notice that women's actual names are in parens next to their husbands names
# This is something that needs to be fixed but string analysis is kinda weird
# So we can do that later if y'all are interested ( hint: it involves REGEXs!)
# Anyway let's see some pretty pictures
titanic.data %>%
ggplot(aes(x = Pclass, fill = Survived)) +
geom_bar()
# "From the graph, it is clear that the number of passenger who survived is
# independent on the Class of passenger, while the number of passenger
# who couldn’t survived seems to be dependent on the class of passenger."
titanic.data %>%
ggplot(aes(x = Sex, fill = Survived)) +
geom_bar(stat = "count", position = "fill")
# More women survived than men
titanic.data %>%
ggplot(aes(x = Age, fill = Survived)) +
geom_histogram()
# neat
# I'll wrap up here by going over some cool things functions you can use for
# explo & analysis
n_distinct(titanic.data$Pclass)
unique(titanic.data$Cabin)
# dplyr is a package with some very cool tools for wrangling
# select, filter, mutate, arrange, group_by
titanic.data %>% select(PassengerId, Survived)
titanic.data %>% select(PassengerId, Survived) %>% filter(Survived == 1)
# dplyr is my favorite package because it's awesome but we will go in depth
# another time, for now you just want to know that it's the go-to
# for wrangling
###################################
# I'll stop here, now it's up to you to ask interesting questions and
# make your own observations!