# Part (a): Exploring Data with R
### UE22CS342AA2 - Data Analytics






## Prerequisites

This worksheet aims to develop your understanding of summary statistics and basic visualizations through a pragmatic approach. 

## Resources
- Check out [this](https://intro2r.com/) beautifully comprehensive resource for everything you need to get started with R.
- [This online book](https://r-graphics.org/) provides guided explananations about visualizations in R using the ggplot2 library.


## About the Dataset.

- To make this worksheet interesting for you all, we have picked this dataset from Kaggle which comprises of the Movies and the metadata associated with it collected using The Movie Database (TMDB). This dataset is the subset of this [Kaggle dataset](https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata).


- `title` - Name or Title of the movie.
- `budget` - The budget of the movie in American Dollar(USD).
- `genres` - The genres for  the entire movie.
- `id` - The identifier for the movie in The Movie Database(TMDB).
- `original_language` - The language associated with the original version of the film.
- `popularity` - Lifetime popularity score of a movie that is impacted by attributes like number of votes, number of views, etc.
- `release_date` - The release date of the movie.
- `revenue` - The revenue generated by the movie in American Dollar(USD).
- `runtime` - The duration of the movie in minutes.
- `vote_average` - The average of all the votes on the scale of 10.
- `vote_count` - The number of votes for a movie.
- `director` - The director of the movie.


In [None]:
# Read the data from the CSV File.
data <- read.csv("/kaggle/input/movie-dataset/movie_dataset.csv")
head(data)

## Preliminary guided Exercises
- Make sure you have the R programming language installed on your system. It is also recommended to make sure RStudio, the popular IDE for R, is installed. Click [here](https://www.youtube.com/watch?v=H9EBlFDGG4k) for windows installation and [here](https://www.youtube.com/watch?v=I5WIMX4LK8M) for macOS.
- RStudio provides a lot of useful functionality like R markdown, a script editor and GitHub integration. 
- Use RStudio Projects as a great way of keeping each week’s assignment work organized.
- This assignment is to introduce you to R and R Studio. Following this, all worksheets will be exclusively on Kaggle.

## Data Import

In [None]:
# Kaggle
data <- read.csv('/kaggle/input/movie-dataset/movie_dataset.csv', header=TRUE)

# Else - place the path to the dataset.
# data <- read.csv('movie_dataset.csv', header=TRUE)


- The `header = TRUE` argument specifies that the first row of your data contains the variable names. If this is not the case you can specify header = FALSE (this is the default value so you can omit this argument entirely).


## Compact Summary
- Use the `str()` function to return a compact and informative summary of the DataFrame.

In [None]:
str(data)

- Here we see that data is a ‘data.frame’ object which contains 4041 rows and 12 variables (columns). Each the variables are listed along with their data class and the first 10 values.

## Summary Statistics
- To access the data in any of the variables (columns) in our data frame we can use the $ notation. Indexing in R starts at 1, which means the first element is at index 1. Access the first 10 values of the title column:

In [None]:
data$title[1:10]

- We can assign a column to another variable and calculate a mean of a numeric variable or get a summary of a variable using the `summary()` function.


*Problem*: Can you find the summary statistics of the vote_count and the title columns ?

In [None]:
# Your answer here.

In [None]:
summary(data$vote_count)

In [None]:
summary(data$title)

- Notice how the behavior of the summary function changes with different types of variables. Let’s now try to explore how we can visualize our data!


## Scatter Plots and Line Plots
- Plotting and visualisation are an essential aspect in data analysis!
- The most common high level function used to produce plots in R is the `plot` function.

In [None]:
# par(mfrow = c(1,2)) # To plot different plots in the same row
plot(data$budget, type="p") # scatter plot of budget vs the index

*Problem*: Can you show a line plot between the revenue and the index?

In [None]:
# Your answer Here.
plot(data$revenue, type="l") # line plot

## Sorting a data frame
- To sort a dataframe with respect to a column we can use the order() function. Let us sort the dataframe to get the top 10 highest grossing movies.


In [None]:
sorted_data <- data[order(data$revenue, decreasing = TRUE), ] # To sort in descending order 

# The head function is used to get the first 10 rows
top_10_rows <- head(sorted_data, n = 10)

top_10_rows

*Problem*: Can you sort the `vote_average` column in descending order?

In [None]:
# your answer here
sorted_df <- data[order(data$vote_average, decreasing = TRUE),  ]
head(sorted_df)

## Column Transformation
- Highest Revenue might not be the right indicator for a successful movie. So lets plot the ROI (Return on Investment for all movies)

- `ROI = Net Return/Cost of Investment`

In [None]:
data$ROI = data$revenue / data$budget
# Print the first 5 rows with their title and ROI
data[1:5, c("title", "ROI")]

## Data Pre-processing
- A lot of times real-world datasets are not curated and cleaned. Values are not stored in proper formats and hence requires cleaning and appropriate transformation before the data is suitable for analysis. In our case we see that the genre is stored as a string. Lets us split the string to get all genre labels.

In [None]:
# Convert the space-separated string of genres to a list of genres for each movie
data$genres <- strsplit(data$genres, " ")

# Extract the individual genres and count their occurrences
label_counts <- table(unlist(data$genres)) # label_counts will be a data frame with "Var1" anf "freq"
# Sort the counts in descending order
label_counts <- sort(label_counts, decreasing = TRUE)


In [None]:
label_counts

## Using the ggplot2 Library

In [None]:
#Load the ggplot2 library
library(ggplot2)

# Make sure you have label_counts data frame with "Var1" and "Freq" columns
# Convert the table object to a data frame
label_counts_df <- as.data.frame(label_counts)

# Plot the bar chart using ggplot2
ggplot(label_counts_df, aes(x = Var1, y = Freq)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(x = "Genre", y = "Number of Movies", title = "Number of Movies for Each Genre") 
theme(axis.text.x = element_text(angle = 45, hjust = 1))

In [None]:
# Showing the top 5 frequencies
head(label_counts_df)

In [None]:
nrow(data)

- Its evident that close to 50% of the data is of the `Drama` genre.

*fin*