# All in One for the Exam

Time flies, and the exam is coming. To help you to review the course, we have prepared this notebook, which contains all the materials we have covered in the course. We hope this notebook can help you to review the course and prepare for the exam.

This exam review document __does not cover the materials you studed from the lecture, whereas 
it only covers the materials from tutorial and lab__. You should also review the lecture materials.

Here is the list of topics we have covered and will be covered in the exam:

1. introduction to data.table
2. using data.table to manipulate data
3. basic data visualization
4. introduction to linear regression
5. introduction to logistic regression

## 1. Introduction to data.table

Broadly speaking, there are two kinds of data: __structured data__ and __unstructured data__. 
Structured data is data that has a structure, such as a table, whereas unstructured data is data that does not have a structure, such as a text file. In this course, we focus on structured data. This means all the data we will use look like tables, such 
as the following one:

![data.table-example](../drawio/R-data-table-illustration.png)

> Small story (will not be tested in the exam): I had a talk with a person who is the principal data scientist and  working for the government. He told me that the government is implementing a strategy called "AI in 2030". The goal of this strategy is to make the government to be more data-driven and AI-driven. He told me that every year they have to pay a lot of money to consultancies such as Boston Consulting Group (BCG) to do data analysis for them. He said that the government now is planning to combine data scientist and ChatGPT to do the data analysis. They are hoping that they could reduce 40% of the cost by doing so. The idea is that they will only hire BCG for those very complex data analysis tasks. For those simple tasks, they will use AI to do the data analysis. The main tool that BCG uses is Excel, SQL and Tableau. They are all table-based tools. This means having a good understanding of table-based data analysis is very important. This is why we start from data.table.

The basic syntax of data.table is summarized in the following illustration. __You will
not be tested on the syntax of data.table in the exam__. However, you will be tested on the
underlying concepts of data.table, such as the type of variables (integer, character, factor, etc.).
In the future if you will be working as a data scientist, you can use data.table to do big
data analysis. You will need to know the syntax of data.table for practical use not for the exam.

![data.table-syntax](../drawio/R-data-table-illustration2.png)

### 1.1 data.table Lab

Now, we will use data.table to do some data analysis. We will use the `Community Innovation Survey` (CIS)
to do the analysis. The CIS is a survey that is conducted by the European Union (EU) to collect data
about innovation activities of firms. The survey we will use is the 2021 CIS from Germany.

In [3]:
# library for data analysis
library(data.table)
library(magrittr)
library(ggplot2)
library(knitr)
# install stargazer
install.packages("stargazer")
# install ISLR
library(stargazer)
library(MASS)
library(ISLR)

Installing package into '/usr/local/lib/R/site-library'
(as 'lib' is unspecified)



In [4]:
# read data
cis <- fread("https://raw.githubusercontent.com/oceanumeric/data-science-go-small/main/data/innovation_survey/extmidp21.csv")

In [7]:
# check dimension, which shows 5083 rows and 284 columns
dim(cis)

In [8]:
# take a look at the first 5 rows
head(cis)

id,branche,bran_4,filter,ost,ustaat,gb,bges,gk3n,bges18,...,mkosts,mkosts19,wbp,wbp19,wbpx,wbp19x,invs,invs19,invsx,invs19x
<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<dbl>,...,<chr>,<chr>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<chr>,<chr>
300127,Elektroindustrie,Forschungsintensive Industrie,nein,ost,,Bereich,38.401396,50-249Besch,38.401398,...,,,,,,,,,,
301003,Metallerzeugung/-bearbeitung,Sonstige Industrie,ja,ost,,Bereich,4.046923,<50Besch,5.058653,...,.5<=x<.7,.5<=x<.7,0.0,0.0,keine Stutzung,keine Stutzung,,,,
301078,Maschinenbau,Forschungsintensive Industrie,nein,west,,Bereich,497.850854,>=250Besch,,...,.4<=x<.5,.4<=x<.5,0.007223942,0.01153213,keine Stutzung,keine Stutzung,0.044347249,0.06277719,keine Stutzung,keine Stutzung
301084,Energie/Bergbau/Mineraloel,Sonstige Industrie,ja,west,,Bereich,311.483458,50-249Besch,290.13177,...,x>=.7,x>=.7,0.031338606,0.03232491,keine Stutzung,keine Stutzung,0.002553067,0.00315247,keine Stutzung,keine Stutzung
301189,Energie/Bergbau/Mineraloel,Sonstige Industrie,nein,west,,Bereich,751.191355,>=250Besch,,...,x>=.7,x>=.7,0.008867039,0.01385574,keine Stutzung,keine Stutzung,0.15335332,0.0940136,keine Stutzung,keine Stutzung
301282,Elektroindustrie,Forschungsintensive Industrie,nein,west,,Bereich,169.861436,50-249Besch,169.86143,...,,,,,,,,,,
