# 02807: Project 2. Downloading and sampling the data
 

# Zipped files
The following are direct download links to zipped versions of large datasets.

- [listings.zip](https://files.dtu.dk/fss/public/link/public/stream/read/listings.zip?linkToken=02-co91vnKzBb5yr&itemName=listings.zip) (1.2 GB)
- [reviews.zip](https://files.dtu.dk/fss/public/link/public/stream/read/reviews.zip?linkToken=0DNVjN4zVY3CM1WE&itemName=reviews.zip) (3.7 GB)

# Downloading the files

The following commands will download the zipped file to the `/contents/` folder.

In [None]:
!wget -O listings.zip "https://files.dtu.dk/fss/public/link/public/stream/read/listings.zip?linkToken=4Ba7a-4Wu2vyFrtj&itemName=listings.zip"

--2020-11-11 17:46:43--  https://files.dtu.dk/fss/public/link/public/stream/read/listings.zip?linkToken=4Ba7a-4Wu2vyFrtj&itemName=listings.zip
Resolving files.dtu.dk (files.dtu.dk)... 192.38.84.17
Connecting to files.dtu.dk (files.dtu.dk)|192.38.84.17|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/octet-stream]
Saving to: ‘listings.zip’

listings.zip            [                <=> ]   1.16G  2.50MB/s    in 8m 48s  

2020-11-11 17:55:31 (2.26 MB/s) - ‘listings.zip’ saved [1248342905]



In [None]:
!wget -O reviews.zip "https://files.dtu.dk/fss/public/link/public/stream/read/reviews.zip?linkToken=0DNVjN4zVY3CM1WE&itemName=reviews.zip"

--2020-11-11 18:56:48--  https://files.dtu.dk/fss/public/link/public/stream/read/reviews.zip?linkToken=0DNVjN4zVY3CM1WE&itemName=reviews.zip
Resolving files.dtu.dk (files.dtu.dk)... 192.38.84.17
Connecting to files.dtu.dk (files.dtu.dk)|192.38.84.17|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/octet-stream]
Saving to: ‘reviews.zip’

reviews.zip             [            <=>     ]   3.68G  2.53MB/s    in 25m 9s  

2020-11-11 19:21:57 (2.50 MB/s) - ‘reviews.zip’ saved [3954950472]



# Unzipping the files

In [None]:
!unzip /content/listings.zip

Archive:  /content/listings.zip
replace listings.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: yes
  inflating: listings.csv            
  inflating: __MACOSX/._listings.csv  


In [None]:
!unzip /content/reviews.zip

Archive:  /content/reviews.zip
  inflating: reviews.csv             
  inflating: __MACOSX/._reviews.csv  


# Creating the dataframes

In [None]:
!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

openjdk-8-jdk-headless is already the newest version (8u272-b10-0ubuntu1~18.04).
0 upgraded, 0 newly installed, 0 to remove and 12 not upgraded.


In [None]:
# Let's import the libraries we will need
import pyspark
from pyspark.sql import *
from pyspark.sql import functions as f
from pyspark.sql.types import *
from pyspark import SparkContext, SparkConf

In [None]:
# create the Spark session
spark = SparkSession.builder.getOrCreate()

In [None]:
spark

In [None]:
listings_raw = (spark.read.option('header', True)
                         .option('inferSchema', True)
                         .option('multiLine', True)
                         .option('escape', '"').csv('/content/listings.csv'))

In [None]:
listings_raw.show()

+------+--------------------+--------------+------------+--------------------+--------------------+--------------------+--------------------+-------------------+---------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------+----------+--------------------+--------------+-------+--------------------+---------+----------+--------------------+--------------------+------------------+------------------+--------------------+-----------------+--------------------+--------------------+------------------+-------------------+-------------------------+--------------------+--------------------+----------------------+--------------------+-------------+----------------------+----------------------------+---------+----------------+-------+---------+-----------------+------------+-------+--------+---------+-----------------+-----------------+---------------+------------+---------+--------+----+--------+--------------------+-

In [None]:
listings_raw.count()

1330480

In [None]:
reviews_raw = (spark.read.option('header', True)
                         .option('inferSchema', True)
                         .option('multiLine', True)
                         .option('escape', '"').csv('/content/reviews.csv'))

In [None]:
reviews_raw.show()

+----------+---------+----------+-----------+-------------+--------------------+
|listing_id|       id|      date|reviewer_id|reviewer_name|            comments|
+----------+---------+----------+-----------+-------------+--------------------+
|    145320|156423122|2017-05-30|  123386382|        Erwin|Prima plek om Sto...|
|    145320|170211906|2017-07-15|  123091743|         Anne|Cosy and clean fl...|
|    145320|172169175|2017-07-20|      78004|     Patricia|The host canceled...|
|    145320|176647581|2017-07-31|  103178743|    Charlotte|Kim's place was o...|
|    145320|185676021|2017-08-22|    4023961|    Alexander|great spacious ap...|
|    145320|189668224|2017-09-02|  142869362|        Heiko|Kim is a very fri...|
|    145320|191894030|2017-09-09|   25194419|        Jason|The apartment is ...|
|    145320|193316070|2017-09-13|   52056015|        David|Nicely appointed,...|
|    145320|196760607|2017-09-24|    3980456|        Janne|It was a pleasure...|
|    145320|201885633|2017-1

# Creating a sample of 100k rows for `listings`

In [None]:
listings_100k = listings_raw.limit(100000)

In [None]:
listings_100k.show()

+------+--------------------+--------------+------------+--------------------+--------------------+--------------------+--------------------+-------------------+---------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------+----------+--------------------+--------------+-------+--------------------+---------+----------+--------------------+--------------------+------------------+------------------+--------------------+-----------------+--------------------+--------------------+------------------+-------------------+-------------------------+--------------------+--------------------+----------------------+--------------------+-------------+----------------------+----------------------------+---------+----------------+-------+---------+-----------------+------------+-------+--------+---------+-----------------+-----------------+---------------+------------+---------+--------+----+--------+--------------------+-

In [None]:
listings_100k.count()

100000

In [None]:
(listings_100k
   .repartition(1)
   .write.format("com.databricks.spark.csv")
   .option("header", True)
   .option('multiLine', True)
   .option('escape', '"')
   .save("listings_100k.csv"))

You can adapt the code above to create a sample for the `reviews` dataset. You may also want to subset the data in different ways: choosing certain columns, filtering, etc.

