We'll begin by loading the dplry library, and importing a csv file containing stats on which cities worldwide are home to the most Starbucks outlets.  The original data is from 2014, and I found it over at [Future Learn](https://www.futurelearn.com/courses/maths-power-laws/0/steps/12163):

In [2]:
library(dplyr)

In [3]:
starbucks <- read.csv('starbucks.csv', stringsAsFactors = FALSE)

We can use the **head()** function to view the first six lines of the data frame:

In [4]:
head(starbucks)

ï..city,abb,continent,population,outlets
Beijing,BJ,Asia,20035455,137
Calgary,CGY,North America,1547484,121
Chicago,CHI,North America,2694236,164
Dallas,DAL,North America,1382267,84
Edmonton,EDM,North America,1461182,72
Hong Kong,HK,Asia,7547652,75


That "ï.." in the first column looks a little ugly.  There's a few ways that we could remove it, but renaming the column ought to be simple enough:

In [5]:
colnames(starbucks)[which(names(starbucks) == "ï..city")] <- "city"
head(starbucks)

city,abb,continent,population,outlets
Beijing,BJ,Asia,20035455,137
Calgary,CGY,North America,1547484,121
Chicago,CHI,North America,2694236,164
Dallas,DAL,North America,1382267,84
Edmonton,EDM,North America,1461182,72
Hong Kong,HK,Asia,7547652,75


The **class()** function confirms that we're working with a data frame, and by checking the structure with **str()**, we can see that we're working with 25 rows of 5 variables: 

In [6]:
class(starbucks)

In [7]:
str(starbucks)

'data.frame':	25 obs. of  5 variables:
 $ city      : chr  "Beijing" "Calgary" "Chicago" "Dallas" ...
 $ abb       : chr  "BJ" "CGY" "CHI" "DAL" ...
 $ continent : chr  "Asia" "North America" "North America" "North America" ...
 $ population: int  20035455 1547484 2694236 1382267 1461182 7547652 2340888 15190336 662000 9304016 ...
 $ outlets   : int  137 121 164 84 72 75 135 106 136 202 ...


Uncommenting any of the below would allow us to view all elements within indiviudal variables:

In [8]:
#starbucks$city
#starbucks$abb
#starbucks$continent
#starbucks$population
#starbucks$outlets

And it might be helpful to know what classes we're working with:

In [9]:
class(starbucks$city)
class(starbucks$abb)
class(starbucks$continent)
class(starbucks$population)
class(starbucks$outlets)

By using **sort()** on the column containing number of retail outlets, we can get an idea of the range of values in our data.  We can also use the **mean()** function on this variable to see the average number of outlets per city in our data set: 

In [10]:
sort(starbucks$outlets)
round(mean(starbucks$outlets))

The **order()** function allows us to see a list of cities, ordered by lowest number of outlets to highest:

In [11]:
index <- order(starbucks$outlets)
starbucks$city[index]

**max()** and **(min)**, and then indexing with **which.max()** and **which.min()** show us that Seoul (284) has the most outlets, and San Jose (71) the least:

In [12]:
max(starbucks$outlets)
i_max <- which.max(starbucks$outlets)
starbucks$city[i_max]
min(starbucks$outlets)
i_min <- which.min(starbucks$outlets)
starbucks$city[i_min]

So that I can fully enjoy my visits, I'd appreciate it if the Starbucks outlets weren't too crowded.  We can get outlet per capita figures by dividing total number of outlets by city population * 100,000.  We'll also round this to 2 significant figures:

In [13]:
per_100k <- signif(starbucks$outlets / starbucks$population*100000, 2)
per_100k

And we can now order the cities by the per capita figure, from high to low:

In [14]:
starbucks$city[order(per_100k, decreasing = TRUE)]

We'll add this new variable to our Starbucks data frame with the **mutate()** function:

In [15]:
starbucks <- mutate(starbucks, per_100k = signif(starbucks$outlets / starbucks$population*100000, 2))
head(starbucks)

city,abb,continent,population,outlets,per_100k
Beijing,BJ,Asia,20035455,137,0.68
Calgary,CGY,North America,1547484,121,7.8
Chicago,CHI,North America,2694236,164,6.1
Dallas,DAL,North America,1382267,84,6.1
Edmonton,EDM,North America,1461182,72,4.9
Hong Kong,HK,Asia,7547652,75,0.99


Since I live in London, I'd like to go somewere with even more Starbucks outlets per capita than my home city.  We can extract this number from the data frame by using **which()** to find London's index number...

In [16]:
index <- which(starbucks$city == "London")
index

...and then applying it to the per_100k variable in the data frame:

In [17]:
starbucks$per_100k[index]

I think I'd like my next vacation to be in North America, and it'd be great if the Starbucks per capita number was relatively high - at least double London's number - so, 4.4.  By saving these two conditions to individual variables, then indexing the city column against them, we can build a list of candidate vacation cities:

In [18]:
manystars <- per_100k >= 4.4
northam <- starbucks$continent == "North America"
index <- manystars & northam
starbucks$city[index]

I've never been to Canada, and wonder what the Starbucks per capita rates are like for Calgary and Edmonton.  We can use the **match()** function to extract these figures:

In [19]:
index <- match(c("Calgary", "Edmonton"), starbucks$city)
starbucks$per_100k[index]

I'd quite like to visit New Orleans, Memphis, Las Vegas, or Atlanta.  We can use **%in%** to see whether any of these cities are in the Starbucks data frame:

In [20]:
c("New Orleans", "Memphis", "Las Vegas", "Atlanta") %in% starbucks$city

Out of curiosity, I wonder which cities have a lower per 100k rate than London.  We can use **filter()** on our data frame to get this info:

In [21]:
filter(starbucks, per_100k <2.2)

city,abb,continent,population,outlets,per_100k
Beijing,BJ,Asia,20035455,137,0.68
Hong Kong,HK,Asia,7547652,75,0.99
Istanbul,IST,Europe,15190336,106,0.7
Mexico City,CDMX,North America,21671908,160,0.74
Shanghai,SH,Asia,27058479,256,0.95


For my final list of candidate cities, I decide that I want an even higher per capita rate - this time doubled again to 8.8 outlets per 100,000 people.  We can create a final condensed data frame, consisting of city name, abbreviation, and Starbucks per capita figure, by combining **select** and **filter**:

In [22]:
starbucks2 <- select(starbucks, city, abb, per_100k)
filter(starbucks2, per_100k >= 8.8)

city,abb,per_100k
Las Vegas,LV,21.0
Portland,PO,15.0
San Diego,SD,8.8
San Francisco,SF,9.2
Seattle,SEA,18.0
Washington DC,DC,11.0


We could have also used the **Pipe** (%>%) operator to prouce the same table:

In [23]:
starbucks3 <- starbucks %>% select(city, abb, per_100k) %>% filter(per_100k >= 8.8)
starbucks3

city,abb,per_100k
Las Vegas,LV,21.0
Portland,PO,15.0
San Diego,SD,8.8
San Francisco,SF,9.2
Seattle,SEA,18.0
Washington DC,DC,11.0


Looks like my next vacation is in one of these six cities!

In [None]:
population_in_millions <- starbucks$population/10^6
total_starbucks <- starbucks$outlets
plot(population_in_millions, total_starbucks)

In [None]:
hist(starbucks$per_100k)

In [None]:
boxplot(per_100k~continent, data = starbucks)