**Vanilla R GUI**
![R interface](rinterface.png)

R: A Calculator ..
====


In [None]:
10+5

In [None]:
20*6+12*6+12/4


In [None]:
12 %% 4 # '%%'  Modulus operator 

In [None]:
log10(100)

In [None]:
log(20)

In [None]:
exp(2.995732)


In [None]:
round(exp(2.995732))

R:Intercative Data Analysis- Vector Arithmetic ..
========
class: small-code

* In data analysis single valued objects (like a=1)have no relevance.Here, a variable is something that varies from case to case.
* R's data  type vector: a collection of values of a single type


In [None]:
height=c(113,66,66,69.5,70,70.5,71,72,72.25,72.5,72.75,67,116) 
height

In [None]:
mean(height)

In [None]:
sd(height)

In [None]:
meansdori=c(mean=mean(height),sd=sd(height))
meansdori 

* Let us assume that we made an error in the measurements and to correct this mistake we need to add 5 to each height values.
* R's vector arithmetic comes in handy

In [None]:
height=height+5
height

In [None]:
meansdadd=c(mean=mean(height),sd=sd(height))
meansdadd 

* So, transformation of data by just adding a constant to all the members of a data vector has no impact on the SD but it affects the average

* Now, let us change the unit of measurement from inches to centimeters- multiply each element of 'height' by 2.54

In [None]:
height=height-5
height=height*2.54
height

In [None]:
meansdmult=c(mean=mean(height),sd=sd(height))
meansdmult

* Notice the change in values of both mean and SD. This means multiplication affects both mean and SD

R: Data analysis with online data..
===
* World Bank Data (http://data.worldbank.org/)
 * http://data.worldbank.org/indicator/ST.INT.RCPT.CD
* R users don't need to move to the World Bank site and download it
* Use the package WDI, which helps us search and extract world development indicators data directly from R

In [None]:
# install.packages('WDI')
install.packages('WDI')
library('WDI')

* One can search for the required data using the command WDIsearch. For the data on tourism receipts proceed as follows:

In [None]:
WDIsearch('tourism')

*  Let us select "ST.INT.RCPT.CD",the code for "International tourism, receipts (current US$)" data
* Now, we will download the data using the function WDI

In [None]:
start_date=1995
end_date=2013
dat2 = WDI(indicator="ST.INT.RCPT.CD",  start=1995, end=end_date,extra=TRUE)
str(dat2)
no_of_years=end_date-start_date+1
no_of_years

* The data set 'dat2' contains tourism receipts for the period 1995 (start date) to 2013 (end date) across several countries
* Let us list out a few observations (with selected variables)

In [None]:
dat2[1:19,c("country","ST.INT.RCPT.CD","region","year")]


* The data set contains not just country data alone; some of the rows are region aggregates. So, let us eliminate all such rows.

In [None]:
dat2=subset(dat2,region!="Aggregates")
dim(dat2)

* Now, let us pick only relevant variables and drop the rest

In [None]:
dat2=subset(dat2,select=c("country","ST.INT.RCPT.CD","year"))
dim(dat2)

* How many countries' data we have?

In [None]:
length(unique(dat2$country))
length(unique(dat2$country))*no_of_years

In [None]:

print(c('Now we have',no_of_years, 'years of data for', length(unique(dat2$country)),'countries  (total rows=', length(unique(dat2$country))*no_of_years))

Now let us clean the data

* Some cases have no data (with NA values) and we need to eliminate them 

* Let us eliminate all cases  with no data using the function 'complete.cases()'- this function returns a logical vector indicating which cases are complete


In [None]:
dat2=dat2[complete.cases(dat2),]
dim(dat2)

* Now, in our  data set, though most countries have data for all the 19 years (1995-2013), a  few ones still exist with incomplete data
* Let us find the countries with incomplete data


In [None]:
cases1=tapply(dat2$ST.INT.RCPT.CD,dat2$country,length)
# * The vector 'cases1' contains size information (no. of data points for each of the countries)

In [None]:
cases1[1:5]
dim(cases1)

In [None]:
#  Now, we just need to filter out all the countries with size less than 19
eliminate1=names(cases1[cases1<no_of_years])
# The vector 'eliminate1' contains the names of all countries with incomplete data. We will use this vector and remove all countries with incomplete data
# dat3=dat2[cases1==19,]
dat2=subset(dat2,!(country %in% eliminate1))
dim(dat2)

In [None]:
length(unique(dat2$country))

In [None]:
print(c('Now we have',no_of_years, 'years of data for', length(unique(dat2$country)),'countries  (total rows=', length(unique(dat2$country))*no_of_years))

In [None]:
# * Now we have 19 years complete data for 155 countries
# * Before moving ahead, let us rescale the receipts data  to millions of dollars by dividing it with 1000000
dat2$ST.INT.RCPT.CD=dat2$ST.INT.RCPT.CD/1000000

In [None]:
# * Now let us create a time series graph for 4 countries; here we use the package ggplot2

library(ggplot2) 
# library('png')


In [None]:
ggplot(subset(dat2,country %in% c("Brazil", "Russian Federation", "India", "China","South Africa")), aes(year, ST.INT.RCPT.CD, color=country)) + geom_line() + xlab('Year') + ylab('Tourist Receipts')+ ggtitle("Tourist Receipts for BRICS countries: 1995-2013" )

R: Data analysis with online data: Some insights
===

In [None]:
# * Let us find the average receipts across countries
avg1=tapply(dat2$ST.INT.RCPT.CD,dat2$country,mean)
avg1[1:10] #Average for first 10 countries


In [None]:
# * Average for India, China and Pakistan
avg1[c('India','China','Pakistan')]


In [None]:
# Let us compute median, maximum,minimum, range,standard deviation etc
med1=tapply(dat2$ST.INT.RCPT.CD,dat2$country,median)
min1=tapply(dat2$ST.INT.RCPT.CD,dat2$country,min)
max1=tapply(dat2$ST.INT.RCPT.CD,dat2$country,max)
range1=max1-min1

In [None]:
# * We can list out the values of these parameters for any of the countries:Median value for Brazil is shown below
med1['Brazil']

In [None]:
# Maximum value for Brazil, Russia, India, China and South Africa
max1[c("Brazil", "Russian Federation", "India", "China","South Africa")]

In [None]:
# * If you wish to find the list of first 10 leading countries in international tourism domain, use the following command
sort(avg1,decreasing=TRUE)[1:10]