# Monitoring infra setup:

### [ `Prerequisites` ]

#### [ Within VDI ]

> download <mark>Miniconda3-py37_4.8.2-Linux-x86_64.sh</mark> from https://repo.anaconda.com/miniconda/

> download & install WinSCP

> upload Miniconda3-py37_4.8.2-Linux-x86_64.sh to <mark>Telegraf host</mark> to /tmp or /home/${USER}/UI/install_files (but first create the folder for that - see below)


#### [ Telegraf host CLI ]

`update bashrc for the monitoring user:`

In [None]:
#run once:

cleantext="
export HISTTIMEFORMAT="[%Y-%m-%d %H:%M:%S] "
HISTSIZE='INFINITY'; HISTFILESIZE='ANDBEYOND'

PS1='\e[37m\D{%H:%M}\e[91m[\e[90m\u@\h \e[33m\w\e[31m]\e[92m\n\$'

alias ll='ls -alF'
alias la='ls -A'
alias l='ls -CmF'
alias lr='ls -ltrh'
alias ufind="find / -name $1 2>/dev/null"

export PATH=$PATH:/home/${USER}/scripts
"

echo "$cleantext" >> /home/${USER}/.bashrc.sh

In [None]:
# Check OS vesion:
cat /etc/os-release

(below commands are for RHEL distro)

In [None]:
# install prereqs with elevated (sudo) user:

# prod:
yum install -y alsa-lib bc gcc gcc-c++ kernel-devel libXScrnSaver libXcomposite libXcursor libXdamage libXi libXrandr libXtst libffi-devel libxslt-devel mesa-libEGL mesa-libGL msodbcsql18.x86_64 openssl-devel unixODBC-devel
# dev:
subscription-manager repos --enable=codeready-builder-for-rhel-8-x86_64-rpms # required for x11
yum install -y alsa-lib bc gcc gcc-c++ kernel-devel libXScrnSaver libXcomposite libXcursor libXdamage libXi libXrandr libXtst libffi-devel libxslt-devel mesa-libEGL mesa-libGL msodbcsql18.x86_64 openssl-devel unixODBC-devel xorg-x11-apps xorg-x11-xauth firefox

In [None]:
# create folders:
mkdir -p  /home/${USER}/UI/Flask ~/UI/install_files

# give execute rights to the installed:
chmod +x Miniconda3-py37_4.8.2-Linux-x86_64.sh
#     Long press Enter; then "yes"; then "yes" (again)
#     when completed - reload shell:
. ~/.bashrc

# install requirements for Flask API (maybe you need to update the path to requirements folder: check "flask_wapi_UAT")
cd  ~/UI/flask_wapi_UAT/requirements; `for': for file in $(ls) ; do pip install ./${file}; done
#    check if all are required (wheel; tar.gz.. some might be duplicates)
#    install the remaining (not included - since I've changed from Anaconda to miniconda) packages via proxy command:
for package in Flask python-dotenv pandas pyodbc; do pip install $package --proxy "http://USER:XXXX@XXX.XXX.XXX.XX:XXXX"; done

# open the firewall port:
firewall-cmd --add-port=8000/tcp
firewall-cmd --add-port=8000/tcp --permanent

# load the flask config and start the UI:
. ../.flaskenv
. ../.flask run
#    [dev/test instance]
flask run --host 0.0.0.0  
#    [Prod]
IP="$(hostname -I | awk '{print $1}')"
nohup flask run --host $IP &
# to close the Flask UI type "fg" and press ctrl+c; or "kill %1" -but be sure that is the only background process that is running! or 'pkill flask'; to close all related tasks: 'sudo killall -u ${USER}'  # note that they will still be running from cron or when the server is restarted

[VDI] `UI`
> open any preferred web browser (i.e.: Brave/Edge/Firefox/Chrome/Opera...)

> open the UI: http://<mark>XX.X.XX.XX</mark>:8000 #replace with correct IP address <mark># note that the firewall port 8000 has to be opened!</mark> (as stated above)

[Telegraf host] `cli backend scripts and scheduled cronjobs`

##### CHANGELOG // current version = v1.09; 2022.10.19 (Author: Michal Márkus)
 for changes prior to 9.02 please check the meeting invite "Infra Self Monitoring (agenda)"
- 9.02: Added harvest check 
- 09.06: Added uptime reporting (not included in this script; new script is called "instert_uptime.sh" & is cronned to run daily 5 minutes after midnight)
- 9.13: solution migrated to PROD & configuration adjusted; added influx-bucet monitoring
- 10.04: assets (grafana/influx/harvest/telegraf) added to Radix Shared cockpit (needed for maintenance & uptime reporting)
- 10.11: logrotation for the backend script logs added (for more details look for '/etc/logrotate.d/flask' below); NodeRed is now also monitored; ticket creation validated on PROD - is working; maintenance info added to the UI as well as 'past incidents' tab got enhanced from both frontend & backend perspective; added daily backup cronjob which backs up the most crucial parts of the monitoring solution (flask, scripts, configuration files); pushed the current version to GitLab
- 10.19: added failover solution to ticketing; fixed many minor things (such as: silenced curl verbose standard output for some services as for when the script is called manually - instead added progressbar to show what the script is doing currenlty and print how long it run; fixed havest ticket creation #as more detailed tracing broke the proper aquisition of hostname - similarly at serviceup5min function...); moved the ENV specific variables to the beginning of the script so that it can be more easily deployed/changed/migrated; added some comments and refined the code

---

### [ `Backend` ]
### <mark>monitoring_services.sh</mark>

In [None]:
# create backend script:
vi /home/${USER}/monitoring_services.sh  
# ^this is the script that is cronned; the other one '/home/${USER}/monitoring_services' (note that there is no ".sh" extension) serves as 'pre-prod' script to manually check and debug when new features are added or something is to be changed

In [None]:
#!/bin/bash 

# TIMER -start
res1=$(date +%s.%N)
# measure runtime of this script


# D E B U G  M O D E
# set -x


####################################################################################
#       """ CHANGE THESE VALUES ACCORDING TO YOUR ENVIRONMENT:"""

flask_path=/home/${USER}/UI/flask_wapi
ENVIRONMENT="EU PROD"  # change depending on environment (can be: EU PROD / US / APAC)
services="grafana harvest influx nodered telegraf"
grafana_url="https://grafana.apps.XXXX.xxxgroup.net/api/health" 
HARVEST_HOSTS="HarvestHostXXX1 HarvestHostXXX2 HarvestHostXXX3..."
influx_url="https://influxdb.apps.XXXX.xxxgroup.net/health"
nodered_endpoint_url="https://nodered.apps.XXXX.xxxgroup.net/admin/"
primary_endpoint=https://XXXX.xxxgroup.net:8443/bppmws/api/Event/create?routingId=pforwemcellXX
secondary_endpoint=https://XXXX.xxxgroup.net:8443/bppmws/api/Event/create?routingId=pforwemcellXX

#          NOTE: In the notebook I've omitted the password for ${USER} user due to compliance&security reasons
                          # If copying the script from here (instead GitLab); then
                          #    update the "USER_PASSWORD" with proper values!!!
                          #    otherwise harvest service will not be monitored
        
#           READ THIS! - YOU HAVE TO also MANUALLY UPDATE the following items:
                #event_id (under create_ticket part)

                           # EU PROD IDs start with XXX10
                           # EU UAT IDs start with XXX00
                           # US IDs start with XXX20
                           # Apac IDs start with XXX30

####################################################################################



DATE=`date +'%m/%d/%Y %H:%M:%S'`
err_msg="not running @$DATE"
ok_msg="OK @$DATE"
cd $flask_path



# Define function to check service status:
services_check()
{
        echo -e '\e[1A\e[K\nchecking service status for'
        echo -ne '          (0%)\r'


# ACTIVE_IQ
    # not defined yet

# GRAFANA:
    grafana_status()
    {
	echo -e '\e[1A\e[Kchecking service status for Grafana'
	echo -ne '#                   (5%)\r'

        grafana_url=$grafana_url # PROD (port is 443)
        grafana_check="$(curl -s $grafana_url | grep -oh [[:alpha:]]*ok[[:alpha:]]*)"  # checks if status is "ok"
        grafana_latency=$(curl -s -w 'Establish Connection: %{time_connect}s\nTTFB: %{time_starttransfer}s\nTotal: %{time_total}s\n' 127.0.0.1:3000/ping/api/health | egrep "Total: [1-9]"|cut -d' ' -f2) ;  # checks if latency is above 1 second
        # log status (possible error: since the nested if there might be a case when the url is not valid, however only latency will be reported...):
        [ $(eval echo \$"${service}_check") == 'ok' ] && [ -z "$latency" ] ||  echo -e "$DATE $service latency is ${service}_latency high" >> ${service}_high_latency.log && echo "${service} $ok_msg" >> ${service}_uptime.log || echo "${service}" $err_msg >> ${service}_uptime.log
        export grafana_latency
    }

# HARVEST:
    harvest_status()
    {
	echo -e '\e[1A\e[Kchecking service status for Harvest'
	echo -ne '##                  (10%)\r'	

        for H in $HARVEST_HOSTS;
            do
                harvest_status="$(echo "QQ-USER_PASSWORD"  | /home/${USER}/scripts/.hrp ssh ${USER}@${H} 'systemctl status harvest')"
                echo "$harvest_status" > harvest_status_${H}.txt
                harvest_check=$(echo "$harvest_status" | sort -u | grep running | wc -l)
                if ! [ "$(echo $harvest_check)" == 1 ]; then echo "${service}" $err_msg >> ${service}_${H}_uptime.log && echo "${service}| $err_msg |$H" >> harvest_uptime.log; else echo "${service} $ok_msg" >> ${service}_${H}_uptime.log; fi
            done
        harvest_status="$(cat harvest_status*.txt)"; export harvest_status
        pollers="$(echo $harvest_status| grep "not running")"
    }

# INFLUX:
    influx_status()
    {
	echo -e '\e[1A\e[Kchecking service status for Influx'	
	echo -ne '###                 (15%)\r'

        influx_url=$influx_url  # PROD (port is 443, but that doesn't have to be explicitly defined as it is the default gateway)
        influx_check="$(curl -s $influx_url | grep status |  grep -oh [[:alpha:]]*pass[[:alpha:]]*)"  # check if status is "pass"
        influx_latency=$(curl -s -w 'Establish Connection: %{time_connect}s\nTTFB: %{time_starttransfer}s\nTotal: %{time_total}s\n' 127.0.0.1:8086 | egrep "Total: [0-9]"|cut -d' ' -f2)
        [ $(eval echo \$"${service}_check") == 'pass' ] && [ -z "$latency" ] || echo -e "$service latency is $influx_latency high @$DATE" | tee -a ${service}_high_latency.log ${service}_uptime.log && echo "${service} $ok_msg" >> ${service}_uptime.log || echo "${service}" $err_msg  >> ${service}_uptime.log && influx_latency_err="$influx_latency"

        influx_bucket_write_check="$(cat /etc/telegraf/logs/*.log | grep -i "error writing" | egrep "`date +'%Y-%m'`" | egrep `date +'%H:%M:'` | wc -l)"
        if [ "$(echo $influx_bucket_write_check)" -gt 1 ]; then influx_bucket_write_status="Write ERROR"; echo -e "$service cannot write into buckets @$DATE" | tee -a  ${service}_write_err.log ${service}_uptime.log; fi
        export influx_latency_err
        export influx_bucket_write_status

    }


# NodeRed
    nodered_status()
    {
        echo -e '\e[1A\e[Kchecking service status for NodeRed'
        echo -ne '####                (20%)\r'

	nodered_url=$nodered_endpoint_url
        nodered_check="$(curl -s -o - -I "$nodered_url" -X GET | grep -oh [[:alpha:]]*OK[[:alpha:]]*)"
        [ $(eval echo \$"${service}_check") == 'OK' ] && echo "${service} $ok_msg" >> ${service}_uptime.log || echo "${service}" $err_msg >> ${service}_uptime.log
        export nodered_check
    }


# TELEGRAF:
    telegraf_status()
    {
	echo -e '\e[1A\e[Kchecking service status for Telegraf'
	echo -ne '####                (20%)\r'

	systemctl | grep telegraf | sort -u | grep running > telegraf_status.txt  # needed for flask UI

        telegraf_check="$(systemctl | grep telegraf | sort -u | grep running | grep not | awk '{print $1}')"
        telegraf_check_count=$(echo "$telegraf_check"  | wc -l)
        if [ -z "$telegraf_check_count" ]; then echo "${service}" $err_msg >> ${service}_uptime.log; else echo "${service} $ok_msg" >> ${service}_uptime.log; fi
        export telegraf_check
    }

# LOOP OVER SERVICES to check status:
for service in $services 
    do
        ${service}_status
    done
}



# send ticket (directly to Remedy) if service is down for 5 consecutive minutes (checked by every minute via cronjob)

ticket()
{

EPOCHNOW=`date -d "${DATE}" +"%s"`

    serviceup5min()
        {
        echo -e '\e[1A\e[Kchecking if any service was down for the past 5 consecutive minutes'
        echo -ne '######              (30%)\r'

        c=0  # counts how many times the app was down ruing the past 5 minutes (0 = no outage; 1..4 partial; 5 = app is down for 5 minutes)
        for((i=1;i<=5;++i))
            do
            stat=$(tac ${service}_*uptime.log | sed -n "${i},1p")
            dat=$(echo $stat| cut -d@ -f2| cut -d'|' -f1)

            if [[ "$stat" == *"OK"*  ]]; then return 1

            else
                epoch_dat=`date -d "${dat}" +"%s"`
                if [ "$(echo $EPOCHNOW-$epoch_dat|bc)" -le "360"  ] # less or equal to 360 seconds AKA 6 min (5min +1min grace time due to latency)
                    then c=$((c+1))
                    export c
                    if [ "$c" == 1 ]; then echo "donwtime `date  +"%m/%d %H:%M:%S"` /" >>  ${service}_downtime.rep; fi
                fi
            fi

            done
        }


        create_event()  # only create & send event if it is down for 5 consecutive minutes - this part is skipped otherwise
                {
	
		echo -e '\e[1A\e[Kcreating ticket'
        	echo -ne '#######             (35%)\r'

                        if [[ $c -ne 5 ]]; then return 1  # debug mode is: "-eq"; normal mode is: "-ne" (not equal)
                        else
                                # ADAPTER_HOST  ## Name of the host where the evet was created == Telegraf (always)
                                adapter_host=`hostname`

                                # MSG is dynamically generated message; consists of the parts below:
                                if [[ "$service" == "telegraf" ]]; then resolver_team="Monitoring";
				elif [[ "$service" == "harvest" ]]; then resolver_team="Storage"; 
				else resolver_team="Linux"; fi

                                # HOSTNAME / DD (unique differentioator - description) / EventID -- UPDATE ID ACCORDING YOUR ENVIRONMENT!
                                        # EU PROD IDs start with XXX10
                                        # EU UAT IDs start with XXX00
                                        # US IDs start with XXX20
                                        # Apac IDs start with XXX30
                                # DD for telegraf is sub processes; for harvest pollers; for grafana latency; for influx latency & write issue
                                if [[ "$service" == "telegraf" ]]; then hname=`hostname` && dd=$telegraf_check && event_id="XXX10"
                                elif [[ "$service" == "grafana" ]]; then hname="localhost" && dd="$grafana_latency" && event_id="XXX11"
                                elif [[ "$service" == "influx" ]]; then hname="localhost" && dd="echo $influx_latency_err $influx_bucket_write_status" && event_id="XXX12"
                                elif [[ "$service" == "harvest" ]]; then H=$(find $flask_path -name 'harvest*.log' -exec grep 'not running' {} \; -print | grep log | sort -u | cut -d_ -f4 | grep -v uptime); hname=$H && dd=$pollers && event_id="XXX13"
                                elif [[ "$service" == "nodered" ]]; then hname=`hostname` && dd=$nodered_check && event_id="XXX14"
                                fi

                                # msg Example: "[$resolver_team] CRITICAL: $service is down $date, $details"
                                msg="ß$resolver_team¤ CRITICAL: $service is down,DATE"

                                # CONTRACT_ID
                                ENV=$ENVIRONMENT

                                if [[ "$ENV" == "EU PROD" ]]; then contract_id="XXXXXXXXXXXXX"
                                elif [[ "$ENV" == "EU UAT" ]]; then contract_id="XXXXXXXXXXXXX"
                                elif [[ "$ENV" == "US" ]]; then contract_id="XXXXXXXXXXXXX"
                                elif [[ "$ENV" == "APAC" ]]; then contract_id="XXXXXXXXXXXXX"
                                fi


                   # SED process - create event.json from json.SED file
                                cd /home/${USER}/scripts/ticket
                                cat json.SED > event.json  # json.SED is a template with high caps pseudo 'variables' which are replaced with small caps real variables
                                for REPLACE in ADAPTER_HOST MSG CONTRACT_ID HNAME EVENT_ID SERVICE DD; do replace=${REPLACE,,}; sed -i "s/${REPLACE}/${!replace}/g" event.json; done
                                sed -i "s/ß/\[/g" event.json; sed -i "s/¤/\]/g" event.json  # this is to replace the brackets around the $resolver_team as otherwise it would mess up the sed command...
                                sed -i "s/DATE/$(date +'%Y\/%m\/%d %H:%M:%S')/g" event.json  # this replaces the date in DD2 field; again this has to be explicitly this way, because otherwise sed fails to do what it should.

                                cd -
                        fi
                }


    send_event()
        {
	echo -e '\e[1A\e[Kprocessing ticket'
        echo -ne '#######             (35%)\r'

	if [[ $c -eq 5 ]]; then

        cd /home/${USER}/scripts/ticket

        primary_endpoint=$primary_endpoint
        if ! [ -z $secondary_endpoint ]; then secondary_endpoint=$secondary_endpoint; else secondary_endpoint=$primary_endpoint; fi
        curl_cmd='curl -o - -s  -k -d "@event.json" -H "Content-Type: application/json" -H "authorization: basic cXFreTAyMDpUNkhLeWdfUjVrcmQ0M0s= " -X POST '

        eval $(echo "$curl_cmd $primary_endpoint") | grep 'statusCode":"200"' > ep1.out
        if ! [[ "$(cat ep1.out)"  == *'statusCode":"200"'* ]]; then
          echo -e "\e[1A\e[Ksending ticket to primary endpoint failed - sending to secondary";
          echo "Endpoint: $primary_endpoint, status: $(cat ep1.out)" >> ticket_send_error.log;
          eval $(echo "$curl_cmd $secondary_endpoint") | grep 'statusCode":"200"' > ep2.out
          if [[ "$(cat ep2.out)"  == *'statusCode":"200"'* ]]; then
            echo -e "\e[1A\e[KTicket succesfully sent"; else
            echo -e "\e[1A\e[Ksending ticket FAILED";
            # if ticket was not sent to any of the endpoints then at least write it to a logfile:
            echo "Endpoint: $primary_endpoint, status: $(cat ep2.out)" >> ticket_send_error.log;
          fi
        fi

        cd -

	fi

        }


    for service in $services
        do
		serviceup5min && create_event && send_event  # as mentioned above the create event and sendevent only runs if a service is down for 5 conseq. miniutes
        done
}



manage_logs()
    {

	echo -e '\e[1A\e[Kgenerating log files'
        echo -ne '########            (40%)\r'

        # GENERATE UPTIME LOGS FOR FLASK
        service_uptime()  # Grafana & Influx
        {
	echo -e '\e[1A\e[Kgenerating log files - for telegraf'
        echo -ne '########            (40%)\r'	

                if [[ "$service" = "telegraf" ]]; then
                telegraf_sub_service_upt()
                    {
                    stat=`systemctl status telegraf_${sub}.service`
                    [[ $(echo "$stat" | grep "running") == *running* ]] && r="Running" || r="Down"
                    REPORT=$(echo $stat |tr ' ' '\n' |grep -A4 since | tail -3 | tr '\n' ' '|cut -d';' -f1)

                    echo -e "$r|since $REPORT" > telegraf_${sub}_uptime.txt

                    }

                for sub in esx system traps  #broadcom cisco esx storage system traps (these were defined only on UAT)
                do
                    telegraf_sub_service_upt
                done


                elif [[ "$service" = "harvest" ]]; then
                harvest_services_upt()
                    {
		    echo -e '\e[1A\e[Kgenerating log files for harvest'
		    echo -ne '##########          (50%)\r'

                    stat="$harvest_status|grep $H"
                    [[ $(echo "$stat" | grep "running") == *running* ]] && r="Running" || r="Down"
                    REPORT=$(echo $stat |tr ' ' '\n' |grep -A4 since | tail -3 | tr '\n' ' '|cut -d';' -f1)

                    echo -e "$r|since $REPORT" > harvest_${H}_uptime.txt

                    }

                for H in $HARVEST_HOSTS 
                do
                    harvest_services_upt
                done


                else
		echo -e '\e[1A\e[Kgenerating log files for grafana & influx'
        	echo -ne '###############     (75%)\r'
                
		last=$(tac ${service}_uptime.log | grep -A1 -m 1 "not")  # sample: grafana OK @08/11/2022 17:28:03
                up=$(echo "$last" | tail -1)
                epoch_up=`date -d "$(echo $up | cut -d@ -f2)" +"%s"`
                down=$(echo "$last" | head -1)
                epoch_down=`date -d "$(echo $down | cut -d@ -f2)" +"%s"`
                prev_down=$(tac ${service}_uptime.log | grep -m 2 "not" | tail -1)
                epoch_prev_down=`date -d "$(echo $prev_down | cut -d@ -f2)" +"%s"`
                seconds=`echo "$epoch_down"-"$epoch_prev_down"|bc`

                if [[ "$seconds" -eq 0 ]]
                        then echo "UNKNOWN" > ${service}_uptime.txt
                else
                        downtime_minutes=$(echo $seconds/60|bc)
                        service_uptime=`echo $(echo "$EPOCHNOW"-"$epoch_up"|bc)/60|bc`

                        REPORT=$(date --date="$service_uptime" +"%m/%d %H:%M:%S")

                        echo -e "Down: $down|Up:$up\nOutage Time: $last_outage minutes|$REPORT" > ${service}_uptime.txt
                fi
            fi
        }

       for service in $services 
        do
          service_uptime
        done


        past_incidents()
        {

	# PAST INCIDENTS part has been outsourced to ANOHTER SCTIPT (past_incidents.sh) that is scheduled hourly
	# this part only ensures that the UI always has these files, else the dinamically generated webpage breaks

	for f in montly weekly today;
	    do 
 	        touch ${f}.csv
	    done   

	}

	past_incidents

        echo -ne '####################(100%)\r'
        echo -ne '\n'

	echo -e '\e[1A\e[K '
	echo -e '\e[2A\e[K...completed!'

}


# Run main parts of the script:

services_check
ticket
manage_logs



# TIMER STOP (calculate runtime):
res2=$(date +%s.%N)
dt=$(echo "$res2 - $res1" | bc)
dd=$(echo "$dt/86400" | bc)
dt2=$(echo "$dt-86400*$dd" | bc)
dh=$(echo "$dt2/3600" | bc)
dt3=$(echo "$dt2-3600*$dh" | bc)
dm=$(echo "$dt3/60" | bc)
ds=$(echo "$dt3-60*$dm" | bc)
echo
printf "script run for: %d:%02d:%02d:%02.4f\n" $dd $dh $dm $ds
echo
set +x

#exit 0


In order to be able to run `systemctl status harvest` <b>remotely</b> you need to have the script below:
> `vi ~/home/${USER}/.hrp`

In [None]:
#!/bin/bash 
[[ $1 =~ password: ]] && cat || SSH_ASKPASS="$0" DISPLAY=nothing:0 exec setsid "$@"

...or simply [create passwordless ssh](https://linuxize.com/post/how-to-setup-passwordless-ssh-login/) between the hosts (for me this option was not permitted); 
- in that case you'll need to update the `monitoring_services.sh` script at `harvest_status()` function accordingly 

For Remedy you need a similar template so that the `monitoring_services.sh` can generate an event/ticket
> `vi ~/scripts/${USER}/json.SED`

```
[
    {
        "attributes": {
            "CLASS": "STORM_Netapp",
            "source": "NetApp",
            "severity": "MINOR",
            "origin": "monitoring",
            "sub_origin": "TrueSight_REST",
            "adapter_host": "ADAPTER_HOST",
            "msg": "MSG",
            "contract_id": "CONTRACT_ID",
            "ars_esc": "Yes",
            "ars_delay_time": "0",
            "hostname": "HNAME",
            "mc_host": "HNAME",
            "sub_source": "monitoring script",
            "bmw_esvr_type": "BEM",
            "server_loc": "ITA-Lab",
            "event_id":"EVENT_ID",
            "dd1":"critical:HNAME`: SERVICE",
            "dd2":"DATE"
        }
    }
]
```
Note that the `ADAPTER_HOST`; `MSG`; `CONTRACT_DI`; `HNAME`; `EVENT_ID`; `SERVICE`; `DATE` are automatially replaced with correct values by the monitoring script; other fields are static

### `Maintenance info backend for UI & insert_uptime for reporting:`

### [ required ] ODBC config
> vi <mark>/etc/odbc.ini</mark>

In [None]:
#create a /etc/odbc.ini file with the content below: (replace servername, username and password with proper values!)
[DWH]
server = servername
#driver = unixodbc
driver = /opt/microsoft/msodbcsql18/lib64/libmsodbcsql-18.1.so.1.1
database = Common_View
username = "XXXX"
password = 'XXXX'
TrustServerCertificate = yes
Trace = Yes
TraceFile = /home/${USER}/UI/flask_wapi/odbc.log

#### [ *optional* ] Details regarding odbc and sql scripts 
`-- Double click for details --`

<!---

###### - Bash command to connect to mysql database:
> (server, username and password is omitted by "XXX" here - to be replaced with real values!)


$`cat query_maint`
> #!/bin/bash

> isql -k "DRIVER={ODBC Driver 18 for SQL Server};SERVER=XXX,1433;UID=XXX;PWD=XXX;Authentication=SqlPassword;TrustServerCertificate=Yes" -v -b -d, -q < /home/$[USER}/scripts/maint.sql

[isql man](https://www.mankier.com/1/isql#Options)

`-b`: Run isql in non-interactive batch mode. In this mode, the isql processes its standard input, expecting one SQL command per line.

-dDELIMITER: Delimits columns with delimiter.

`-c`: Output the names of the columns on the first row. Has any effect only with the -d or -x options.

`-q`: Wrap the character fields in double quotes.

$`cat maint.sql`
> SELECT * FROM Common_View.monitoring.V_Infrastructure_Maintenance

$`query_maint`  # output:
> 1,"Component_Grafana","Grafana",2022-08-17 00:00:00.0000000,2022-08-18 00:00:00.0000000,"12324","test chl"



###### - StorageClass with python and pandas:
```
import pandas as pd
import subprocess
import pymysql
from io import StringIO

status, storage_df = subprocess.getstatusoutput('query_storage')
TESTDATA = StringIO(storage_df)

df = pd.read_csv(TESTDATA, sep=",")
df.shape
df.to_csv('/home/$[USER}/storage.csv', index=False)
```
##### `query_storage`:
```
similarly as `query_maint` above calls an isql bash command with corresponding SQL (storage.sql):
```

> SELECT * FROM Common_View.monitoring.V_StorageClass


###### - How to export data (to export in html format add "-w" after isql (in 'query' script file)
isql.sh < commands.sql  # >/dev/null 2>&1



###### - Proxy for pacakge installation
> pip install <package>  --proxy "http://USER:localhost@XXXX"  #HTTP PROXY

> pip install <package>  --proxy "http://USER:localhost@XXXX"  #HTTPS PROXY

> pip install <package>  --proxy ".xxxgroup.net" #NOPROXY

#### Requirements (after installing miniconda)
```
sudo yum install unixODBC-devel
sudo yum -y install gcc gcc-c++ kernel-devel
sudo yum -y install python-devel libxslt-devel libffi-devel openssl-devel
pip install Flask --proxy "http://USER:localhost@XXXX"
pip install python-dotenv  --proxy "http://USER:localhost@XXXX"
pip install pandas --proxy "http://USER:localhost@XXXX"
pip install pyodbc --proxy "http://USER:localhost@XXXX"```

-->

### - [ Backend scripts ] *(continued)*

### <mark>insert_uptime.sh</mark>

In [None]:
#!/bin/bash

#cd $flask_path
cd /home/${USER}/UI/flask_wapi

echo "" | tee ./*downtime.rep > /dev/null

downtime_check=$(cat *_uptime.txt |  awk '{print $2}' | grep [0-9] |sort -n | head -1)
if [[ " $downtime_check" -lt  "1440" ]]  # if outage happened within the past 24 hours
 then	for service in grafana harvest influx telegraf;
		do start_time=$(cat ${service}_downtime.rep); end_time=$(cat ${service}_*uptime.txt|cut -d'|' -f3 | sort -n | tail -1); date_check=$(echo $end_time | grep [0-9])
			#if [[ -z "$date_check" ]]; then end_time=$(echo `date +"%Y-%m-%d"` 23:59:59); fi
			if [[ -z "$date_check" ]]; then end_time=$(echo `date +%Y-%m-%d -d "yesterday"` 23:59:59); fi  # as this script is scheduled (cron) to run after midnight

			if [[ $service = "grafana" ]]; then pfrx="graf" 
			elif [[ $service = "harvest" ]]; then prfx="hv"
			elif [[ $service = "influx" ]]; then prfx="infl"
			elif [[ $service = "harvest" ]]; then prfx="tg"
			fi

			sql="INSERT INTO [DATABASE].[dbo].[TABLE]([DWH_Key],[DWH_CreatedBy],[DWH_CreatedDate],[ServerName],[InterruptionStart_UTC],[InterruptionEnd_UTC],[TicketNo]) VALUES ('`hostname`|$service|$start_time', 'insert_uptime.sh', {ts'`date +"%Y-%m-%d %H:%M:%S"`'}, '${prfx}_HOSTNAME', {ts'$start_time'}, {ts'$end_time'}, null)"

			echo "$sql" > uptime.sql
			query < uptime.sql
			echo Donwtime report sent to database at `date +"%Y-%m-%d %H:%M:%S"` >> database_downtime_insertions.log
		done
 else
	echo No downtime in the past 24 hours.
fi

# note that this requires to have a sulition set up from MSSQL side as well!

### query_maint 
(note that you have to replace the `server_name`; `user_id`; `password` values below)

In [None]:
#!/bin/bash
isql -k "DRIVER={ODBC Driver 18 for SQL Server};SERVER=server_name,1433;UID=user_id;PWD=password;Authentication=SqlPassword;TrustServerCertificate=Yes" -v -b -d, -q < /home/${USER}/scripts/sql/maint.sql

### <mark>maintenance.py</mark>

In [None]:
#!/home/${USER}/miniconda3/bin/python

import pandas as pd 
import json 
import requests 
import pyodbc 
import os
import subprocess
from io import StringIO

import time
start = time.time()

status, maint_df = subprocess.getstatusoutput('query_maint')
DATA = StringIO(maint_df)
df = pd.read_csv(DATA, sep=",")
df.to_csv('/home/${USER}/UI/flask_wapi/maintenance.csv', index=False)
#df.shape

### <mark>past_incidents.sh</mark>

In [None]:
#!/bin/bash

# TIMER -start
res1=$(date +%s.%N)
# measure runtime of this script

flask_path="/home/${USER}/UI/flask_wapi"
cd $flask_path


past_incidents()

{

    for T in MONTH WEEKS DAYS;
        do declare t=${T,,};
            RANGE=$(date -d "$date -1 ${t}" +"%s");

            rm logfiles 2>/dev/null; for file in *.log; do echo $file >> logfiles; cat logfiles|sort -u > logfiles.list; done
            for service in grafana harvest influx telegraf; 
		do file=$(cat logfiles.list|grep $service); cat $file|egrep -i '(not|high)' | while read LINE;
		    do x=$(echo $LINE |cut -d'@' -f2 | cut -d' ' -f1); 
			if ! [[ $x == '' ]]; then y=$(date -d "$x" +"%s"); 
				if [ "$RANGE" -le "$y" ]; then echo $LINE >> incidents_${t}.csv; 
				fi; 
			fi; 
		    done; 
		done
        done

cat incidents_days.csv | sort -u > today.csv
cat incidents_weeks.csv | sort -u > weekly.csv
cat incidents_month.csv | sort -u > montly.csv

}


past_incidents


# TIMER STOP (calculate runtime):
res2=$(date +%s.%N)
dt=$(echo "$res2 - $res1" | bc)
dd=$(echo "$dt/86400" | bc)
dt2=$(echo "$dt-86400*$dd" | bc)
dh=$(echo "$dt2/3600" | bc)
dt3=$(echo "$dt2-3600*$dh" | bc)
dm=$(echo "$dt3/60" | bc)
ds=$(echo "$dt3-60*$dm" | bc)
echo
printf "script run for: %d:%02d:%02d:%02.4f\n" $dd $dh $dm $ds
echo

backup
> `vi ~/scripts/backup_flask`

In [None]:
#!/bin/bash

dir="/home/${USER}/backup"
mkdir -p $dir
D=`date +\%Y\%m\%d\%H\%M\%S`; cd $dir

config_backup()
    {
            cp /home/${USER}/.bashrc ${dir}/.bashrc
            cp /etc/odbc.ini ${dir}/odbc.ini
            cp /etc/logrotate.d/flask ${dir}/flask
    }

if ! [ -z "$1" ]
 then
    config_backup
    tar cvzf scripts.tar.gz /home/${USER}/scripts && tar cvzf flask_wapi.tar.gz /home/${USER}/UI/flask_wapi && tar cvzf telegraf.tar.gz /etc/telegraf
 else
    tar cvzf scripts_${D}.tar.gz /home/${USER}/scripts && tar cvzf flask_wapi_${D}.tar.gz /home/${USER}/UI/flask_wapi && tar cvzf telegraf.tar.gz /etc/telegraf
fi


## CRONTAB:

In [None]:
# Edit the crontab via:
crontab -e
# paste the below:
*/1 * * * source /home/${USER}/.bashrc; /home/${USER}/monitoring_services.sh 2>/dev/null  
*/1 * * * * source /home/${USER}/bashrc; /home/${USER}/scripts/maintenance.py > /dev/null 2>&1  # runs every minute
0 * * * * source /home/${USER}/.bashrc; /home/${USER}/scripts/UI/past_incidents.sh 2>/dev/null  # runs every hour
0 20 * * * source /home/${USER}/.bashrc; backup_Flask full > /dev/null 2>&1  # backs up monitoring stuff daily
5 0 * * * source /home/${USER}/.bashrc; insert_uptime.sh  # since bashrc is sourced no fullpath is needed
@reboot /home/${USER}/UI/f1ask_wapi/run  # starts flask UI

logrotate
> `vi /etc/logrotate.d/flask`

/home/${USER}/UI/flask_wapi/*uptime.log
{
    rotate 3
    create 0644 ${USER} ${USER}
    monthly
    size 10M
    missingok
    dateext
    copytruncate
    notifempty
    compress
    delaycompress
}

# UI [ Monitoring Interface ]

`http://<telegraf_host_IP>:8000`

### UI Details


##### GitLab URL (XXXX internal): 
`https://XXXX.xxxgroup.net/monitoringsolutions/monitoring_services/-/blob/main/`


<b>*routes.py*</b> is the main UI script which renders the html's that are located under `~/UI/flask_wapi/application/templates`

<b>*layout.html*</b> is the main html file which contains `css` formatting, `navbar` and `footer` that is present on all pages
- <b>*infra.html*</b> is the homepage which shows services status (green if all related jobs are running, yellow if partial, red if dead)
> content is dinamically generated via backend script `monitoring_services.sh` and ~/UI/flask_wapi/subprocesses/`*.py`
- <b>*past incidents*</b> pill/tab: shows incidents that happened within the past 24 hours / week / month
- <b>*maintenance*</b> pill/tab: shows assets that were/are under maintenance


![Home Page](https://raw.githubusercontent.com/mmarkus13/flask_monitoring/main/UI/infra.png "Home Page")
![Past Incidents](https://raw.githubusercontent.com/mmarkus13/flask_monitoring/main/UI/pastincidents.png "past incidents")
![Maintenance](https://raw.githubusercontent.com/mmarkus13/flask_monitoring/main/UI/maintenance.png "maintenance")

# Decommission / 'Uninstall' / How to remove the monitoring solution:

> rm -rf ~/miniconda3  `# also remove related lines from the ~/.bashrc file`

> rm -rf ~/scripts ~/UI/flask*/

---

# Tool configurations:

# - <mark>Telegraf</mark>

### Info from NetApp:
---
```
I prepared the Grid configuration already, so that the Telegraf can start to collect the data. 
Please take care that it can take up to 15 minutes with the initial data collection till it get reflected into the InfluxDB with the “prometheus” _measurements. 

StorageGRID requires a certificate authentication, so in addition I attached you the required certificates. 
Move them in the /etc/telegraf directory or subdirectory (modify tls_ca/tls_ca_cert & tls_key path in this case).

There are 3 configuration parts to be modified / checked. 

#1  modify the common Telegraf config (at the beginning of the config file)
[agent]
   interval = “60s”
   metric_batch_size = 5000
   metric_buffer_limit = 75000


#2  add the Storagegrid Input config
 [[inputs.prometheus]] 
   urls = ['https://XX.X.XX.XX:XXXX/federate?matchXXXX']
   metric_version = 2
   tls_ca = "/etc/telegraf/cacert.pem"
   tls_cert = "/etc/telegraf/cert.pem"
   tls_key = "/etc/telegraf/key.pem"
   insecure_skip_verify = true
   response_timeout = "59s"


#3 check your [outputs.influxdb_v2]] configuration. 
Telegraf will write the data into the according bucket you set here. 


After this restart the Telegraf (via cmd # sudo systemctl stop telegraf & # sudo systemctl start telegraf). 
15 Minutes after this, the InfluxDB will reflect the StorageGRID data.
``` 

> Source: mail @Fri 22/05/13 13:11


## UAT Telegraf config steps (cli)
Telegraf configuration
Telegraf agents located on: `XXXX` (hostname omitted)

> <b>Sources</b>:
- SNMP trap receiver
- SNMP query for Cisco Switches
- VM server data receiver
installation folder:
`/etc/telegraf`

> <b>Services</b>:
*each telegraf host has its own sub services*
>#### sytemctl status telegraf_broadcom.service
>`/usr/bin/telegraf -config /etc/telegraf/telegraf_broadcom.conf -config-directory /etc/telegraf/telegraf_broadcom`
>#### sytemctl status telegraf_cisco.service
>`/usr/bin/telegraf -config /etc/telegraf/telegraf_cisco.conf -config-directory /etc/telegraf/telegraf_cisco`
>#### sytemctl status telegraf_esx.service
>`/usr/bin/telegraf -config /etc/telegraf/telegraf_esx.conf -config-directory /etc/telegraf/telegraf_esx`
>#### sytemctl status telegraf_storage.service 
>`/usr/bin/telegraf -config /etc/telegraf/telegraf_storage.conf -config-directory /etc/telegraf/telegraf_storage`
>#### sytemctl status telegraf_traps.service
>`/usr/bin/telegraf -config /etc/telegraf/telegraf_traps.conf -config-directory /etc/telegraf/telegraf_traps`
>#### sytemctl status telegraf_system.service
>`/usr/bin/telegraf -config /etc/telegraf/telegraf_system.conf -config-directory /etc/telegraf/telegraf_system`

To receive SNMP traps from AIQ UM two MIB file required to copied to the configured path where the MIB's name are important
- NETAPP.MIB
- OCUM.MIB (this is a renamed aiqum_9.9.mib)

# - <mark>Influx</mark>

## Install Influx CLI and Modify `bucket's retention`
Install Influx CLI/Modify bucket's retention:

Download package from the following URL: https://docs.influxdata.com/influxdb/cloud/tools/influx-cli/?t=Windows

Install CLI to VDI: Because we haven't permission on the 'C:\Program Files' folder, need modify the original command:
Ori: Expand-Archive .\influxdb2-client-2.3.0-windows-amd64.zip -DestinationPath 'C:\Program Files\InfluxData' 

mv 'C:\Program Files\InfluxData\influxdb2-client-2.3.0-windows-amd64' 'C:\Program Files\InfluxData\influx'

Modified: Expand-Archive .\influxdb2-client-2.3.0-windows-amd64.zip -DestinationPath 'C:\InfluxData' mv 'C:\InfluxData\influxdb2-client-2.3.0-windows-amd64' 'C:\InfluxData\influx'

Use Powershell for the following
Before issuing the above command, navigate to the folder where you downloaded the CLI package. For example:
```
cd C:\Users"USERNAME"\Downloads`
mkdir C:\InfluxData`
Expand-Archive .\influxdb2-client-2.3.0-windows-amd64.zip -DestinationPath 'C:\InfluxData'
mv 'C:\InfluxData\influxdb2-client-2.3.0-windows-amd64' 'C:\InfluxData\influx'
Navigate to the C:\InfluxData\influx // because we cannot modify the 'path' variable, need to go to the folder where the influx.exe exists
Create an influx CLI's config for the remote host: .\influx config create -a -n CONFIGNAME -u URL -t TOKEN_WHICH_HAS_PROPER_PRIVILEGES -o ORGANIZATION
List bucket's current settings:
PS C:\InfluxData\influx> .\influx.exe bucket list ID Name Retention Shard group duration Organization ID Schema Type XXXXX BroadcomBES 1440h0m0s 24h0m0s XXXXX implicit XXXXX CiscoBackend 1440h0m0s 24h0m0s XXXXX implicit
Modify bucket's retention: Command reference: https://docs.influxdata.com/influxdb/v2.2/organizations/buckets/update-bucket/
.\influx bucket update -i BUCKET_ID -r NEW_RETENTION_TIME
```
Done
