Multi-Node Cluster Database with Cassandra on Linux

Part 1: Setting up Cassandra on the Linux VM's

A: Setup on Individual Nodes

Make sure JAVA_HOME path is added in the "~/.bashrc" file
Type sudo vim ~/.bashrc and add the path in the end. The path is usually in the /usr/lib/jvm/ folder.

Check the Path by typing echo $JAVA_HOME
Make sure curl is installed, else install it using apt-get
To install Cassandra on your node, type the following commands, or Follow these instructions: http://cassandra.apache.org/download/:

echo "deb https://downloads.apache.org/cassandra/debian 311x main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list

Add the Apache Cassandra repository keys:

curl https://downloads.apache.org/cassandra/KEYS | sudo apt-key add -

Update the repositories:

sudo apt-get update

Install Cassandra:

sudo apt-get install cassandra
Start Cassandra with sudo service cassandra start and check the status using nodetool status and sudo service cassandra status commands:

B: Configure Cassandra nodes to make them work together

Stop Cassandra on all three nodes:
sudo service cassandra stop
Delete the default dataset:
sudo rm -rf /var/lib/cassandra/data/system/*
Edit the configuration file cassandra.yaml on all 3 nodes.It is located in the /etc/cassandra directory:

a. The directive cluster_name should be same across all the nodes:
cluster_name: 'Test Cluster'

b. Update -seeds: by adding all 3 IP addresses, separated by commas:
seeds: "206.81.9.191,67.205.176.120,206.81.9.184"

c. Update listen_address and rpc_address on each node to the IP address to that node.
For "206.81.9.191", it should look like:
listen_address: 206.81.9.191
rpc_address: 206.81.9.191

d. Update endpoint_switch from SimpleSnitch to GossipingPropertyFileSnitch:
endpoint_snitch: GossipingPropertyFileSnitch

e. Finally, add the following directive at the end of the file:
auto_bootstrap: false

Save the cassandra.yaml file all on 3 nodes
Start Cassandra on all 3 nodes again:
sudo service cassandra start
Check the status of the clusters with the sudo nodetool status command. All 3 nodes should be listed:

Part 2: Importing Data into Cassandra:

Make sure Python is installed on the VM, and then run CQL with the cqlsh <nodeIPAdress> command.
In the CQL Terminal, create a keyspace:
CREATE KEYSPACE cloud_log WITH replication = {'class': 'SimpleStrategy','replication_factor':'1'} AND durable_writes = 'true';

Use that keyspace with the use cloud_log ; command.
Create a table in the cloud_log keyspace with all the Common Log Format columns :

CREATE TABLE access_log (IP_Address text, client_id text, client_username text, time text,timezone text, request_line text, status_code text, object_size text, PRIMARY KEY(IP_Address,time,request_line));

Here, we are using the combination of IP_Address, timestamp and request line as the primary key.
Import the access_log data into your table using the copy command:

COPY access_log (IP_Address , client_id , client_username , time , timezone , request_line , status_code, object_size) FROM '/home/student/access_log' WITH DELIMITER=' ';

Check the data with a simple select query:
select * from access_log where IP_Address = '10.207.188.188' limit 5;

Create additional table `Request_line_details` to operate on `path` in the `request_line` column:

In order to run CQL Queries on path, we need to make another table with path as the main primary key.
CREATE TABLE Request_line_details (IP_Address text, time text,method text, path text, protocol text, PRIMARY KEY(path,IP_Address,time,method,protocol));
To load data into this table, we first need to split the request_line column. In order to do so, we first export the data into a new csv file:
COPY cloud_log.access_log (ip_address,time,request_line) TO 'requestline.csv' with DELIMITER = ' ' and QUOTE = ' ';
Next, we copy data from requestline.csv to the request_line_details table:
COPY cloud_log.request_line_details (IP_ADDRESS,time,method, path, protocol) FROM 'requestline.csv' WITH DELIMITER = ' ' and QUOTE = '"' and ESCAPE = '*';

Check the data with a simple select query: select * from request_line_details limit 5;

Part 3: Operate Data in Cassandra:

How many hits were made to the website item “/assets/img/release-schedule-logo.png”?
Ans: 24282

select count(*) from request_line_details where path = '/assets/img/release-schedule-logo.png\';
How many hits were made from the IP: 10.207.188.188?
Ans: 398

select count(*) from cloud_log.access_log where IP_Address = '10.207.188.188' ;

For Q3 and Q4, we will use python scripts and Cassandra-driver:

First run pip install cassandra-driver

Which path in the website has been hit most? How many hits were made to the path?
Ans: Path: /assets/css/combined.css has been hit 115101 times

Run the script: python Cassandra_most_path.py
Which IP accesses the website most? How many accesses were made by it?
Ans: IP: 10.216.113.172 has been accessed 158298 times

Run the script: python Cassandra_most_ip.py

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
CQL Queries		CQL Queries
Gallery		Gallery
Python Scripts		Python Scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Node Cluster Database with Cassandra on Linux

Part 1: Setting up Cassandra on the Linux VM's

A: Setup on Individual Nodes

B: Configure Cassandra nodes to make them work together

Part 2: Importing Data into Cassandra:

Create additional table `Request_line_details` to operate on `path` in the `request_line` column:

Part 3: Operate Data in Cassandra:

For Q3 and Q4, we will use python scripts and Cassandra-driver:

About

Releases

Packages

Languages

nishchalnigam/2750-Cassandra

Folders and files

Latest commit

History

Repository files navigation

Multi-Node Cluster Database with Cassandra on Linux

Part 1: Setting up Cassandra on the Linux VM's

A: Setup on Individual Nodes

B: Configure Cassandra nodes to make them work together

Part 2: Importing Data into Cassandra:

Create additional table Request_line_details to operate on path in the request_line column:

Part 3: Operate Data in Cassandra:

For Q3 and Q4, we will use python scripts and Cassandra-driver:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Create additional table `Request_line_details` to operate on `path` in the `request_line` column:

Packages