---
author: "Юрій Клебан"
---

# CSV

---

You need this packages for code execution:

In [1]:
# install.packages("readr")

In [79]:
invisible(Sys.setlocale("LC_ALL", "Ukrainian"))
invisible(options(warn=-1))

---

## What is CSV (Comma Separated Values)?


`CSV` - comma separated values.

In [80]:
# lets check current working directory to write correct files path
getwd()

You can use `/` or `\\` for writing correct path in R. For example:

In [81]:
path = "d:/projects/file.csv"
path = "d:\\projects\\file.csv"

To combine path use `paste()` or `paste0()` functions

In [82]:
work_dir = getwd()
work_dir 

In [83]:
file_name = "temp_file.csv"
file_path = paste0(work_dir, "/", file_name)
file_path

In [84]:
file_path = paste(work_dir, file_name, sep = "/")
file_path

## Sample dataset description

Information about dataset from [kaggle.com](kaggle.com).
Original file located at url: [https://www.kaggle.com/radmirzosimov/telecom-users-dataset](https://www.kaggle.com/radmirzosimov/telecom-users-dataset).

Any business wants to maximize the number of customers. To achieve this goal, it is important not only to try to attract new ones, but also to retain existing ones. Retaining a client will cost the company less than attracting a new one. In addition, a new client may be weakly interested in business services and it will be difficult to work with him, while old clients already have the necessary data on interaction with the service.

Accordingly, predicting the churn, we can react in time and try to keep the client who wants to leave. Based on the data about the services that the client uses, we can make him a special offer, trying to change his decision to leave the operator. This will make the task of retention easier to implement than the task of attracting new users, about which we do not know anything yet.

You are provided with a dataset from a telecommunications company. The data contains information about almost six thousand users, their demographic characteristics, the services they use, the duration of using the operator's services, the method of payment, and the amount of payment.

The task is to analyze the data and predict the churn of users (to identify people who will and will not renew their contract). The work should include the following mandatory items:

1. Description of the data (with the calculation of basic statistics);
2. Research of dependencies and formulation of hypotheses;
3. Building models for predicting the outflow (with justification for the choice of a particular model) 4. based on tested hypotheses and identified relationships;
5. Comparison of the quality of the obtained models.

**Fields description:**

- [x] `customerID` - customer id
- [x] `gender` - client gender (male / female)
- [x] `SeniorCitizen` - is the client retired (1, 0)
- [x] `Partner` - is the client married (Yes, No)
- [x] `tenure` - how many months a person has been a client of the company
- [x] `PhoneService` - is the telephone service connected (Yes, No)
- [x] `MultipleLines` - are multiple phone lines connected (Yes, No, No phone service)
- [x] `InternetService` - client's Internet service provider (DSL, Fiber optic, No)
- [x] `OnlineSecurity` - is the online security service connected (Yes, No, No internet service)
- [x] `OnlineBackup` - is the online backup service activated (Yes, No, No internet service)
- [x] `DeviceProtection` - does the client have equipment insurance (Yes, No, No internet service)
- [x] `TechSupport` - is the technical support service connected (Yes, No, No internet service)
- [x] `StreamingTV` - is the streaming TV service connected (Yes, No, No internet service)
- [x] `StreamingMovies` - is the streaming cinema service activated (Yes, No, No internet service)
- [x] `Contract` - type of customer contract (Month-to-month, One year, Two year)
- [x] `PaperlessBilling` - whether the client uses paperless billing (Yes, No)
- [x] `PaymentMethod` - payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
- [x] `MonthlyCharges` - current monthly payment
- [x] `TotalCharges` - the total amount that the client paid for the services for the entire time
- [x] `Churn - whether` there was a churn (Yes or No)


## Reading

Thare are few methods for reading/writing csv in `base` package:

- [x] `read.csv()`, `write.csv` - default data separator is `,`, decimal is separator `.`.
- [x] `read.csv2()`, `write.csv2` - default data separator is `;`, decimal is separator `,`.

Before using any new function check it usage information with `help(function_name)` or `?function_name`, example: **`?read.csv`**.

You can read (current data set has NA values as example, there are no NA in original datase):

In [85]:
data <- read.csv("../../data/telecom_users.csv") # default reading
str(data)

'data.frame':	5986 obs. of  22 variables:
 $ X               : int  1869 4528 6344 6739 432 2215 5260 6001 1480 5137 ...
 $ customerID      : chr  "7010-BRBUU" "9688-YGXVR" "9286-DOJGF" "6994-KERXL" ...
 $ gender          : chr  "Male" "Female" "Female" "Male" ...
 $ SeniorCitizen   : int  0 0 1 0 0 0 0 0 0 1 ...
 $ Partner         : chr  "Yes" "No" "Yes" "No" ...
 $ Dependents      : chr  "Yes" "No" "No" "No" ...
 $ tenure          : int  72 44 38 4 2 70 33 1 39 55 ...
 $ PhoneService    : chr  "Yes" "Yes" "Yes" "Yes" ...
 $ MultipleLines   : chr  "Yes" "No" "Yes" "No" ...
 $ InternetService : chr  "No" "Fiber optic" "Fiber optic" "DSL" ...
 $ OnlineSecurity  : chr  "No internet service" "No" "No" "No" ...
 $ OnlineBackup    : chr  "No internet service" "Yes" "No" "No" ...
 $ DeviceProtection: chr  "No internet service" "Yes" "No" "No" ...
 $ TechSupport     : chr  "No internet service" "No" "No" "No" ...
 $ StreamingTV     : chr  "No internet service" "Yes" "No" "No" ...
 $ Streaming

In [86]:
data <- read.csv("../../data/telecom_users.csv",
                  sep = ",", # comma not only possibel separator
                  dec = ".", # decimal separator can be different
                  na.strings = c("", "NA", "NULL")) # you can define NA values

In [87]:
str(data) # chack data structure / types/ values

'data.frame':	5986 obs. of  22 variables:
 $ X               : int  1869 4528 6344 6739 432 2215 5260 6001 1480 5137 ...
 $ customerID      : chr  "7010-BRBUU" "9688-YGXVR" "9286-DOJGF" "6994-KERXL" ...
 $ gender          : chr  "Male" "Female" "Female" "Male" ...
 $ SeniorCitizen   : int  0 0 1 0 0 0 0 0 0 1 ...
 $ Partner         : chr  "Yes" "No" "Yes" "No" ...
 $ Dependents      : chr  "Yes" "No" "No" "No" ...
 $ tenure          : int  72 44 38 4 2 70 33 1 39 55 ...
 $ PhoneService    : chr  "Yes" "Yes" "Yes" "Yes" ...
 $ MultipleLines   : chr  "Yes" "No" "Yes" "No" ...
 $ InternetService : chr  "No" "Fiber optic" "Fiber optic" "DSL" ...
 $ OnlineSecurity  : chr  "No internet service" "No" "No" "No" ...
 $ OnlineBackup    : chr  "No internet service" "Yes" "No" "No" ...
 $ DeviceProtection: chr  "No internet service" "Yes" "No" "No" ...
 $ TechSupport     : chr  "No internet service" "No" "No" "No" ...
 $ StreamingTV     : chr  "No internet service" "Yes" "No" "No" ...
 $ Streaming

In [88]:
head(data, 2) # top 6 rows, use n = X, for viewing top X lines

Unnamed: 0_level_0,X,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
Unnamed: 0_level_1,<int>,<chr>,<chr>,<int>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,...,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<chr>
1,1869,7010-BRBUU,Male,0,Yes,Yes,72,Yes,Yes,No,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Credit card (automatic),24.1,1734.65,No
2,4528,9688-YGXVR,Female,0,No,No,44,Yes,No,Fiber optic,...,Yes,No,Yes,No,Month-to-month,Yes,Credit card (automatic),88.15,3973.2,No


In [89]:
is.data.frame(data) # if data is data.frame

In [90]:
anyNA(data) # if dataframe contains any NA values

In [91]:
lapply(data, anyNA)
#lapply(, any) #check NA by 2nd dimension - columns

Check `MonthlyCharges: TRUE` and `TotalCharges: TRUE`. These columns has NA-values.

Let's replace them with `mean`: 

In [92]:
data[is.na(data$TotalCharges), "TotalCharges"] <- mean(data$TotalCharges, na.rm = T)
data[is.na(data$MonthlyCharges), "MonthlyCharges"] <- mean(data$MonthlyCharges, na.rm = T)

In [93]:
any(is.na(data)) # check for NA

You can write data with `write.csv()`, `write.csv2()` from `base` package.

In [2]:
write.csv(data, file = "../../data/cleaned_data.csv", row.names = F)
# by default row.names = TRUE and file will contain first column with row numbers 1,2, ..., N

ERROR: Error in as.data.frame.default(x[[i]], optional = TRUE): cannot coerce class '"function"' to a data.frame


## `readr` package

One more useful package is `readr`. Examples of using:

In [95]:
# library(readr)
# data <- read_csv(file = "../../data/telecom_users.csv")
# data <- read_csv2(file = "../../data/telecom_users.csv")`

---

## Набори даних

1. https://github.com/kleban/r-book-published/tree/main/datasets/telecom_users.csv
2. https://github.com/kleban/r-book-published/tree/main/datasets/telecom_sers.xlsx
3. https://github.com/kleban/r-book-published/tree/main/datasets/Default_Fin.csv
4. https://github.com/kleban/r-book-published/tree/main/datasets/employes.xml

---

## References

1. [SQLite in R. Datacamp](https://www.datacamp.com/community/tutorials/sqlite-in-r)
2. [Tidyverse googlesheets4 0.2.0](https://www.tidyverse.org/blog/2020/05/googlesheets4-0-2-0/)
<!-- 3. [Telecom users dataset. Practice classification with a telco dataset.Kaggle](https://www.kaggle.com/radmirzosimov/telecom-users-dataset) -->
4. [Binanace spot Api Docs](https://github.com/binance/binance-spot-api-docs/blob/master/rest-api.md#klinecandlestick-data)
5. [Web Scraping in R: rvest Tutorial](https://www.datacamp.com/community/tutorials/r-web-scraping-rvest) by Arvid Kingl