---

# Table of contents

1. Comma Separated Values (CSV)

2. MS Excel (Open XML / XLSX)

2. eXtensible Markup Language (XML)

3. JavaScript Object Notation (JSON) and API 

5. Google Services

    5.1. Spreadsheets    
    5.2. Trends
    
6. SQL (SQLite sample)

7. Web-pages (HTML)

---

There are many data source types for data storing, reading. Let's review and try some of them.

# 1. CSV


`CSV` - comma separated values.

In [None]:
# lets check current working directory to write correct files path
getwd()

You can use `/` or `\\` for writing correct path in R. For example:

In [None]:
path = "d:/projects/file.csv"
path = "d:\\projects\\file.csv"

To combine path use `paste()` or `paste0()` functions

In [None]:
work_dir = getwd()
work_dir 

In [None]:
file_name = "temp_file.csv"
file_path = paste0(work_dir, "/", file_name)
file_path

In [None]:
file_path = paste(work_dir, file_name, sep = "/")
file_path

#### Sample dataset description

Information about dataset from [kaggle.com](kaggle.com).
Original file located at url: [https://www.kaggle.com/radmirzosimov/telecom-users-dataset](https://www.kaggle.com/radmirzosimov/telecom-users-dataset).

Any business wants to maximize the number of customers. To achieve this goal, it is important not only to try to attract new ones, but also to retain existing ones. Retaining a client will cost the company less than attracting a new one. In addition, a new client may be weakly interested in business services and it will be difficult to work with him, while old clients already have the necessary data on interaction with the service.

Accordingly, predicting the churn, we can react in time and try to keep the client who wants to leave. Based on the data about the services that the client uses, we can make him a special offer, trying to change his decision to leave the operator. This will make the task of retention easier to implement than the task of attracting new users, about which we do not know anything yet.

You are provided with a dataset from a telecommunications company. The data contains information about almost six thousand users, their demographic characteristics, the services they use, the duration of using the operator's services, the method of payment, and the amount of payment.

The task is to analyze the data and predict the churn of users (to identify people who will and will not renew their contract). The work should include the following mandatory items:

1. Description of the data (with the calculation of basic statistics);
2. Research of dependencies and formulation of hypotheses;
3. Building models for predicting the outflow (with justification for the choice of a particular model) 4. based on tested hypotheses and identified relationships;
5. Comparison of the quality of the obtained models.

**Fields description:**

- [x] `customerID` - customer id
- [x] `gender` - client gender (male / female)
- [x] `SeniorCitizen` - is the client retired (1, 0)
- [x] `Partner` - is the client married (Yes, No)
- [x] `tenure` - how many months a person has been a client of the company
- [x] `PhoneService` - is the telephone service connected (Yes, No)
- [x] `MultipleLines` - are multiple phone lines connected (Yes, No, No phone service)
- [x] `InternetService` - client's Internet service provider (DSL, Fiber optic, No)
- [x] `OnlineSecurity` - is the online security service connected (Yes, No, No internet service)
- [x] `OnlineBackup` - is the online backup service activated (Yes, No, No internet service)
- [x] `DeviceProtection` - does the client have equipment insurance (Yes, No, No internet service)
- [x] `TechSupport` - is the technical support service connected (Yes, No, No internet service)
- [x] `StreamingTV` - is the streaming TV service connected (Yes, No, No internet service)
- [x] `StreamingMovies` - is the streaming cinema service activated (Yes, No, No internet service)
- [x] `Contract` - type of customer contract (Month-to-month, One year, Two year)
- [x] `PaperlessBilling` - whether the client uses paperless billing (Yes, No)
- [x] `PaymentMethod` - payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
- [x] `MonthlyCharges` - current monthly payment
- [x] `TotalCharges` - the total amount that the client paid for the services for the entire time
- [x] `Churn - whether` there was a churn (Yes or No)


Thare are few methods for reading/writing csv in `base` package:

- [x] `read.csv()`, `write.csv` - default data separator is `,`, decimal is separator `.`.
- [x] `read.csv2()`, `write.csv2` - default data separator is `;`, decimal is separator `,`.

Before using any new function check it usage information with `help(function_name)` or `?function_name`, example: **`?read.csv`**.

You can read (current data set has NA values as example, there are no NA in original datase):

In [None]:
data <- read.csv("data/telecom_users.csv") # default reading

In [None]:
data <- read.csv("data/telecom_users.csv",
                  sep = ",", # comma not only possibel separator
                  dec = ".", # decimal separator can be different
                  na.strings = c("", "NA", "NULL")) # you can define NA values

In [None]:
str(data) # chack data structure / types/ values

In [None]:
head(data, 2) # top 6 rows, use n = X, for viewing top X lines

In [None]:
is.data.frame(data) # if data is data.frame

In [None]:
anyNA(data) # if dataframe contains any NA values

In [None]:
lapply(data, anyNA)
#lapply(, any) #check NA by 2nd dimension - columns

Check `MonthlyCharges: TRUE` and `TotalCharges: TRUE`. These columns has NA-values.

Let's replace them with `mean`: 

In [None]:
data[is.na(data$TotalCharges), "TotalCharges"] <- mean(data$TotalCharges, na.rm = T)
data[is.na(data$MonthlyCharges), "MonthlyCharges"] <- mean(data$MonthlyCharges, na.rm = T)

In [None]:
any(is.na(data)) # check for NA

You can write data with `write.csv()`, `write.csv2()` from `base` package.

In [None]:
write.csv(data, file = "data/cleaned_data.csv", row.names = F)
# by default row.names = TRUE and file will contain first column with row numbers 1,2, ..., N

One more useful package is `readr`. Examples of using:

In [None]:
`#install.packages(“readr”)
library(readr)
data <- read_csv(file = "data/telecom_users.csv", … )
data <- read_csv2(file = "data/telecom_users.csv", … )`

---

# 2. Excel (xlsx)

There are many packages to read/write MS Excel files. `xlsx` one of the most useful.

In [None]:
install.packages("xlsx") #install before use it

In [None]:
library(xlsx)

In [None]:
any(grepl("xlsx", installed.packages())) # check if package installed

**`?read.xlsx`** - review package functions and params

Let's read the data `telecom_users.xlsx`: 

In [None]:
data <- read.xlsx("data/telecom_users.xlsx", sheetIndex = 1)
# sheetIndex = 1 - select sheet to read, or use sheetName = "sheet1" to read by Name

In [None]:
# You can also use startRow, endRow and other params to define how much data read
data <- read.xlsx("data/telecom_users.xlsx", sheetIndex = 1, endRow = 100)

Let's replace `Churn` values `Yes`/`No` by `1`/`0`: 

In [None]:
head(data$Churn)

In [None]:
data$Churn <- ifelse(data$Churn == "Yes", 1, 0)

In [None]:
head(data$Churn)

Write final data to excel:

In [None]:
write.xlsx(data, file = "data/final_telecom_data.xlsx")

## Task 2.1

Download from kaggle.com and read dataset `Default_Fin.csv`: 
https://www.kaggle.com/kmldas/loan-default-prediction

_Description:_

This is a synthetic dataset created using actual data from a financial institution. The data has been modified to remove identifiable features and the numbers transformed to ensure they do not link to original source (financial institution).

This is intended to be used for academic purposes for beginners who want to practice financial analytics from a simple financial dataset

- [x] `Index` - This is the serial number or unique identifier of the loan taker
- [x] `Employed`     - This is a Boolean 1= employed 0= unemployed 
- [x] `Bank.Balance` - Bank Balance of the loan taker
- [x] `Annual.Salary` - Annual salary of the loan taker  
- [x] `Defaulted` - This is a Boolean 1= defaulted 0= not defaulted

In [None]:
1. Check what columns has missing values
2. Count default and non-default clients / and parts of total clients in %
3. Count Employed clients
4. Count Employed Default clients
5. Average salary by Employed clients
6. Rename columns to "id", "empl", "balance", "salary", "default"

---

## Solution for Task 2.1

In [None]:
data <- read.csv("data/Default_Fin.csv")
head(data)

> 1. Check what columns has missing values

In [None]:
anyNA(data)

> 2. Count default and non-default clients / and parts of total clients in %

In [None]:
def_count <- nrow(data[data$Defaulted. == 1, ])
no_def_count <- nrow(data[data$Defaulted. == 0, ])
def_count
no_def_count 

In [None]:
def_count / nrow(data) * 100 # part defaults
no_def_count / nrow(data) * 100 # part non-defaults

> 3. Count Employed clients

In [None]:
empl <- data[data$Employed == 1, ]
nrow(empl)

> 4. Count Employed Default clients

In [None]:
empl <- data[data$Employed == 1 & data$Defaulted. == 1, ]
nrow(empl)

> 5. Average salary by Employed clients

In [None]:
empl <- data[data$Employed == 1, ]
mean(empl$Annual.Salary)

> 6. Rename columns to "id", "empl", "balance", "salary", "default":

In [None]:
colnames(data) <- c("id", "empl", "balance", "salary", "default")
head(data)

# 3. XML

`XML` - eXtensible Markup Language.

For our example we will use data from `data/employes.xml`. File contains records with info:

```
<RECORDS>
   <EMPLOYEE>
      <ID>1</ID>
      <NAME>Rick</NAME>
      <SALARY>623.3</SALARY>
      <STARTDATE>1/1/2012</STARTDATE>
      <DEPT>IT</DEPT>
   </EMPLOYEE>
   ...
</RECORDS>
```

In [None]:
#install.packages("XML")
library("XML")
#install.packages("methods")
library("methods")

In [None]:
result <- xmlParse(file = "data/employes.xml")
print(result)

In [None]:
rootnode <- xmlRoot(result) # reading rootnode of xml document
rootnode[[1]] # reading first record

In [None]:
rootnode[[1]][[2]] # reading first record in root node and second tag, its <NAME>

For us the best way is to get dataframe:

In [None]:
xmldataframe <- xmlToDataFrame("data/employes.xml")
xmldataframe

# 4. JSON and API

`JSON` (`JavaScript Object Notation`) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language Standard.

`API` is the acronym for `Application Programming Interface`, which is a software intermediary that allows two applications to talk to each other. 

One of the most popular packages for `json` is `jsonlite`.

In [1]:
#install.packages("jsonlite")
library(jsonlite)

Let's use readinginformation about BTC and USDT crypro currencies from Binance

In [2]:
market = 'BTCUSDT'
interval = '1h'
limit = 100

url <- paste0(url = "https://api.binance.com/api/v3/klines?symbol=", market ,"&interval=", interval,"&limit=", limit)
print(url) # complete request URL

[1] "https://api.binance.com/api/v3/klines?symbol=BTCUSDT&interval=1h&limit=100"


On the next stage you need use fromJSON() function to get data.

More details about requests to Binanace at https://github.com/binance/binance-spot-api-docs/blob/master/rest-api.md#klinecandlestick-data

If you enter 'url' value at browser response is going to be like this:

```json
[
  [
    1499040000000,      // Open time
    "0.01634790",       // Open
    "0.80000000",       // High
    "0.01575800",       // Low
    "0.01577100",       // Close
    "148976.11427815",  // Volume
    1499644799999,      // Close time
    "2434.19055334",    // Quote asset volume
    308,                // Number of trades
    "1756.87402397",    // Taker buy base asset volume
    "28.46694368",      // Taker buy quote asset volume
    "17928899.62484339" // Ignore.
  ]
]
```

In [3]:
data <- fromJSON(url) # get json and transform it to list()
data <- data[, 1:7] # let's left only 1:7 columns (from Open time to Close time)
head(data)

0,1,2,3,4,5,6
1678334400000.0,21770.37,21774.69,21720.11,21740.43,5953.71527,1678337999999
1678338000000.0,21740.39,21761.33,21728.0,21738.93,5857.40431,1678341599999
1678341600000.0,21738.93,21746.04,21683.49,21741.8,7890.8559,1678345199999
1678345200000.0,21741.25,21742.39,21665.67,21688.54,8160.18877,1678348799999
1678348800000.0,21688.54,21698.38,21602.0,21681.52,15335.28307,1678352399999
1678352400000.0,21682.31,21700.0,21585.0,21641.58,12338.10576,1678355999999


In [4]:
typeof(data) # check data type
data <- as.data.frame(data) # convert to dataframe
head(data)

Unnamed: 0_level_0,V1,V2,V3,V4,V5,V6,V7
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,1678334400000.0,21770.37,21774.69,21720.11,21740.43,5953.71527,1678337999999
2,1678338000000.0,21740.39,21761.33,21728.0,21738.93,5857.40431,1678341599999
3,1678341600000.0,21738.93,21746.04,21683.49,21741.8,7890.8559,1678345199999
4,1678345200000.0,21741.25,21742.39,21665.67,21688.54,8160.18877,1678348799999
5,1678348800000.0,21688.54,21698.38,21602.0,21681.52,15335.28307,1678352399999
6,1678352400000.0,21682.31,21700.0,21585.0,21641.58,12338.10576,1678355999999


In [5]:
# fix columns names
colnames(data) <- c("Open_time", "Open", "High", "Low", "Close", "Volume", "Close_time")
head(data) # looks better, but columns are characters still

Unnamed: 0_level_0,Open_time,Open,High,Low,Close,Volume,Close_time
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,1678334400000.0,21770.37,21774.69,21720.11,21740.43,5953.71527,1678337999999
2,1678338000000.0,21740.39,21761.33,21728.0,21738.93,5857.40431,1678341599999
3,1678341600000.0,21738.93,21746.04,21683.49,21741.8,7890.8559,1678345199999
4,1678345200000.0,21741.25,21742.39,21665.67,21688.54,8160.18877,1678348799999
5,1678348800000.0,21688.54,21698.38,21602.0,21681.52,15335.28307,1678352399999
6,1678352400000.0,21682.31,21700.0,21585.0,21641.58,12338.10576,1678355999999


In [6]:
is.numeric(data[,1]) # check 1st column type is numeric
is.numeric(data[,2]) # check 2nd column type is numeric

In [7]:
data <- as.data.frame(sapply(data, as.numeric)) # convert all columns to numeric
head(data) # good, its double now

Unnamed: 0_level_0,Open_time,Open,High,Low,Close,Volume,Close_time
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1678334000000.0,21770.37,21774.69,21720.11,21740.43,5953.715,1678338000000.0
2,1678338000000.0,21740.39,21761.33,21728.0,21738.93,5857.404,1678342000000.0
3,1678342000000.0,21738.93,21746.04,21683.49,21741.8,7890.856,1678345000000.0
4,1678345000000.0,21741.25,21742.39,21665.67,21688.54,8160.189,1678349000000.0
5,1678349000000.0,21688.54,21698.38,21602.0,21681.52,15335.283,1678352000000.0
6,1678352000000.0,21682.31,21700.0,21585.0,21641.58,12338.106,1678356000000.0


Final stage is to convert `Open_time` and `Close_time` to dates.

In [8]:
data$Open_time <- as.POSIXct(data$Open_time/1e3, origin = '1970-01-01')
data$Close_time <- as.POSIXct(data$Close_time/1e3, origin = '1970-01-01')

head(data) 

Unnamed: 0_level_0,Open_time,Open,High,Low,Close,Volume,Close_time
Unnamed: 0_level_1,<dttm>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
1,2023-03-09 06:00:00,21770.37,21774.69,21720.11,21740.43,5953.715,2023-03-09 06:59:59
2,2023-03-09 07:00:00,21740.39,21761.33,21728.0,21738.93,5857.404,2023-03-09 07:59:59
3,2023-03-09 08:00:00,21738.93,21746.04,21683.49,21741.8,7890.856,2023-03-09 08:59:59
4,2023-03-09 09:00:00,21741.25,21742.39,21665.67,21688.54,8160.189,2023-03-09 09:59:59
5,2023-03-09 10:00:00,21688.54,21698.38,21602.0,21681.52,15335.283,2023-03-09 10:59:59
6,2023-03-09 11:00:00,21682.31,21700.0,21585.0,21641.58,12338.106,2023-03-09 11:59:59


In [9]:
tail(data) # check last records

Unnamed: 0_level_0,Open_time,Open,High,Low,Close,Volume,Close_time
Unnamed: 0_level_1,<dttm>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dttm>
95,2023-03-13 04:00:00,22295.4,22356.8,22066.4,22182.01,23006.55,2023-03-13 04:59:59
96,2023-03-13 05:00:00,22182.65,22282.4,22114.39,22231.62,15593.28,2023-03-13 05:59:59
97,2023-03-13 06:00:00,22232.76,22459.62,22222.88,22320.07,18986.77,2023-03-13 06:59:59
98,2023-03-13 07:00:00,22320.07,22402.29,22285.3,22397.11,13248.24,2023-03-13 07:59:59
99,2023-03-13 08:00:00,22397.11,22506.0,22245.0,22381.9,16172.57,2023-03-13 08:59:59
100,2023-03-13 09:00:00,22381.11,22495.0,22319.06,22483.91,16428.85,2023-03-13 09:59:59


# 5. Google Services

## 5.1. Spreadsheets

> THIS CHAPTER IS UNDER CONSTRUCTION / Working with Google Spreadsheets need account authorization.

googlesheets4 is a package to work with Google Sheets from R. 

In [None]:
#install.packages("googlesheets4")
library(googlesheets4)

You can read google documents after authentification on google service. There is sample code:

```
read_sheet("https://docs.google.com/spreadsheets/d/1U6Cf_qEOhiR9AZqTqS3mbMF3zt2db48ZP5v3rkrAEJY/edit#gid=780868077")
gs4_deauth()
```

Let's read sample dataset `gapminder`. It detailed described in next paragraph.

In [None]:
gs4_example("gapminder")

## 5.2. Trends

**`Google Trends`** is a service for analyzing search requests by many filters like `region` (continent, country, locality), `period` (year, month), `information category` (business, education, hobby, healthcare), `information type` (news, shopping, video, images) https://trends.google.com/trends/

In [None]:
install.packages('gtrendsR')
install.packages('ggplot2')
library(gtrendsR) # loading package for Google Trends queries
library(ggplot2)

Let's configure out google trends query params

In [None]:
keywords = c("Bitcoin", "FC Barcelona") # search keywords
country = c('AT') # search region from https://support.google.com/business/answer/6270107?hl=en
time = ("2021-01-01 2021-06-01") # period
channel = 'web' # search channel: google search ('news' - google news, 'images' - google images)

In [None]:
# query
trends = gtrends(keywords, gprop = channel, geo = country, time = time, tz = "UTC")

In [None]:
time_trend = trends$interest_over_time
head(time_trend)

In [None]:
plot <- ggplot(data=time_trend, aes(x=date, y=hits, group=keyword, col=keyword)) +
  geom_line() +
  xlab('Time') + 
  ylab('Relative Interest') + 
  theme(legend.title = element_blank(), legend.position="bottom", legend.text=element_text(size=15)) + 
  ggtitle("Google Search Volume")  

plot

---

# 6. SQL (SQLite sample)

We are going to review working with database on SQLite, becouse it allows us not to install DB-server and start working with simple file. 

For now we will use `RSQLite` package.

In [None]:
install.packages("RSQLite")
library(RSQLite)

In [None]:
# let's use mtcars dataset

data("mtcars") # loads the data
head(mtcars) # preview the data

In [None]:
# create new db file
db_path = paste0("data/cars_.sqlite")
# create connection
conn <- dbConnect(RSQLite::SQLite(), 
                    db_path,
                    overwrite = TRUE, append = FALSE) # for lecture content only

In [None]:
# Write the mtcars dataset into a table names mtcars_data
dbWriteTable(conn, "cars_table", mtcars)
# List all the tables available in the database
dbListTables(conn)

In [None]:
table_data <- dbGetQuery(conn, "SELECT * FROM cars_table")
head(table_data)

In [None]:
# close connection
dbDisconnect(conn)

You can write complex queries for many tables if you knowledge of SQL allows.

# 7. Web-pages (HTML)

Sometimes decision making needs scrap data from web sources and pages.

Let's try to parse data from `Wikipedia` as table. 



In [None]:
install.packages("rvest")
library(rvest) # Parsing of HTML/XML files

Go to web page https://en.wikipedia.org/wiki/List_of_largest_banks and check it.

In [None]:
# fix URL
url <- "https://en.wikipedia.org/wiki/List_of_largest_banks"
url <- "data/List of largest banks - Wikipedia_.html"

In [None]:
# read html content of the page
page <- read_html(url)
page

In [None]:
# read all yables on page
tables <- html_nodes(page, "table")
tables

For now, let's read a table of Total Assets in US Billion

In [None]:
# with pipe operator
#tables[2] %>% 
 #   html_table(fill = TRUE) %>% 
 #   as.data.frame()
#without pipe operator
assets_table <- as.data.frame(html_table(tables[2], fill = TRUE))   
head(assets_table)

Next is reading data of market capitalization table (4th):

In [None]:
capital_table <- as.data.frame(html_table(tables[4], fill = TRUE))   
head(capital_table)

And now let's `merge()` this two datasets:

In [None]:
merged_data <- merge(assets_table, capital_table, by = "Bank.name")
head(merged_data)

### Task 7.1

From a page https://en.wikipedia.org/wiki/List_of_largest_banks read and `merge by country` named tables:

- [x] Number of banks in the top 100 by total assets
- [x] Total market capital (US$ billion) across the top 70 banks by country

### Sulution

In [None]:
library(rvest)
url <- "https://en.wikipedia.org/wiki/List_of_largest_banks" # got to url in other tab
url <- "data/List of largest banks - Wikipedia_.html"
page_data <- read_html(url) # read html content

tables <- html_nodes(page_data, "table")
html_table(tables[1]) #its not needed table

In [None]:
html_table(tables[3]) # thats solution for "Number of banks in the top 100 by total assets"
#check the end of table. There are NA record
# lets remove it

In [None]:
table1 <- as.data.frame(html_table(tables[3]))
table1 <- table1[!is.na(table1$Country), ]
table1 # now it OK!

In [None]:
# SOlution for "Total market capital (US$ billion) across the top 70 banks by country"
# compare this with table on a given page
table2 <- as.data.frame(html_table(tables[5]))
table2 # now it OK!

---

## References

1. [SQLite in R. Datacamp](https://www.datacamp.com/community/tutorials/sqlite-in-r)
2. [Tidyverse googlesheets4 0.2.0](https://www.tidyverse.org/blog/2020/05/googlesheets4-0-2-0/)
3. [Telecom users dataset. Practice classification with a telco dataset.Kaggle](https://www.kaggle.com/radmirzosimov/telecom-users-dataset)
4. [Binanace spot Api Docs](https://github.com/binance/binance-spot-api-docs/blob/master/rest-api.md#klinecandlestick-data)
5. [Web Scraping in R: rvest Tutorial](https://www.datacamp.com/community/tutorials/r-web-scraping-rvest) by Arvid Kingl