-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Archive files are missing data #19
Comments
What version of the openHistorian are you using? How is the data flowing into the openHistorian? Is this PMU data? Thanks, |
@ritchiecarroll ,
|
So you have developed a custom action adapter that runs inside the openHistorian that receives data from the FDRs? Or are you using the built-in FNET device protocol? The only thing I can think of is that if the timestamps of the incoming data were duplicated - these points would be harder to extract since the historian "key" is based on ID, Timestamp then a counter. |
@ritchiecarroll
|
@ritchiecarroll
|
I am assuming the actual F-NET data comes in using a standard device connection using the built-in If so, the steps below will help us see if the outage is due to a re-connect - it could be that there is a delay in the source data stream and device simply reconnects, a standard operation when data stops flowing. By default this timeout is set 5 seconds which may be too short and the device is reconnecting. By the way, this setting is called "Data Loss Interval" and can be found on the openHistorian Manager device configuration screen. From the openHistorian machine connected to the remote FNET streams, run the “Statistics Trending Tool” – as named in the start menu, actual EXE name is StatisticView.exe in the openHistorian installation folder. Once running, connect to the Statistics historian – note that application should already default to this archive when no other local archive is installed, but just in case, default statistics archive path is “C:\Program Files\openHistorian\Statistics\” – this assuming openHistorian was installed on C: drive. Now, for the connected FNET device, find the statistic that ends with “!IS:ST8”, i.e., input stream statistic number 8 which defines a boolean value representing if the input stream was continually connected during last reporting interval. For statistics the default reporting interval is every 10 seconds, so if the stat is non-zero the device was continually connected over the 10 second period. If you select this statistic and trend this value over the time window for the data gap you found, you will then know if the device connection was interrupted and hence caused the data gap. If the device connection was interrupted, you can go through logs (using “Log File Viewer” application) and check the messages for the device around the same timeframe – from this you can get a clue if the connection was terminated by the remote device or reset because data stopped flowing, this often due to a change in network, e.g., router reboot, etc. If this is not the case then there may be something else going on and I would suggest carefully monitoring the logs for errors or connection issues. Thanks, |
@ritchiecarroll |
We may want you to install a newer build so that we can have enhanced logging on that we can evaluate when a gap occurs. Would this be possible? If so you can update existing version with a nightly build: https://www.gridprotectionalliance.org/NightlyBuilds/openHistorian/Beta/openHistorian.Installs.zip With the new version we have a new detailed logging system that will provide much more detail. Also, new archive log files are time-stamped so that we can find overlapping logs around time of detected data gap. After installation the log files will end up in the "C:\Program Files\openHistorian\Logs" folder. Thanks, |
Also - what about data quality? Are we sure received data is not NaN? Are you writing captured data from your custom adapter to a file that we could review? Thanks, |
@ritchiecarroll
Right now the adapter didn't write data to files. Only the timestamp and value count are captured. Since we have another self-developed server receiving the same forwarded data and didn't have any problem, the data quality should be good. To make it 100% sure, I'll write all the data to files after installing the aforementioned build. |
@ritchiecarroll Didn't find any connection status change. The custom action adapter received all the data, and the data quality is good It seems all data is received by openHistorian and the data quality is good. But something stopped openHistorian from archiving some of time frames. |
OK, thanks. Let's get a copy of the log files around that time frame and see what interesting was going on. |
@ritchiecarroll This is the log file. Please change the extension back to .logz. I'm not allowed to upload logz or zip file... Is there any other log files which may have useful information? |
Yes, just as I suspected. Your system undergoes a very long pause. Possibly
10 seconds in duration. This causes measurements to not be received in the
Lag time window specified in the concentration adapter. We used to have
this problem a ton at OGE, but have turned on some optimizations so this
doesn't happen.
To verify this is the root cause, turn on ThreadPoolMonitoring and restart
openHistorian. You can do this by modifying your openhistorian.exe.config
file. Find the line that says OptimizationsConnectionString and add
EnableThreadPoolMonitoring=;
It should look like this:
<add name="OptimizationsConnectionString"
value="EnableThreadPoolMonitoring=;" description="Specifies which
optimizations to enable for the system." encrypted="false" />
…On Wed, Feb 8, 2017 at 8:48 PM, J. Ritchie Carroll ***@***.*** > wrote:
I don't see the log file attached?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#19 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ARkyJMdS5OmfFs-fMWhqYBz_AEP-YVvhks5ran5kgaJpZM4L7IEB>
.
|
Looks like there is a pause, may be Garabage Collection related and we will want to validate GC settings in your config file (i.e., openHistorian.exe.config). Regardless, the source data stream and device simply reconnects when no data is received during pause, again, a standard operation when data stops flowing. By default this timeout is set 5 seconds which must be too short and the device is reconnecting. This setting is called "Data Loss Interval" and can be found on the openHistorian Manager device configuration screen. Suggest changing this to 15 seconds for each input device to see if data loss stops. Thanks, |
Added EnableThreadPoolMonitoring=;, and set 'Data loss interval' to 15 by modifying table 'Device' in configuration database directly. And openHistorian is restarted. Hope it could help.
I still have some questions. According to the custom action adapter and statistic database, all the data is received by openHistorian and all devices are continually connected. And, if there is a pause, we should observe an obvious change in the time difference between time stamp (UTC) and the log time, which is the local time when the custom adapter received the time frame. But according to the log files the time difference is 2.3 seconds and didn't change. |
I agree - that's what is strange about all this. Also, there are some very unusual messages in your error log. Perhaps the next best step is to have a WebEx and take a look? Want to send me an e-mail about scheduling this? Thanks, |
@ritchiecarroll |
Tomorrow afternoon? Please send me an e-mail at my GPA e-mail address and we will arrange this... |
@ritchiecarroll @StevenChisholm |
@ritchiecarroll @StevenChisholm |
@ritchiecarroll @StevenChisholm |
@StevenChisholm One of the captured examples is shown below, in which
|
Yes, I have confirmed from a code review that x should start from index and
not 0. The consequence of not doing this will cause the streams with the
later values to be ignored if the duplicate value happened to be at the end
of the stream. Since in general, we are not working with many duplicate
values, we never had this issue crop up before.
Let us know if this issue solves the problem and we will update the code.
…On Sat, Feb 11, 2017 at 8:58 PM, yuwenpeng ***@***.***> wrote:
@StevenChisholm <https://github.com/StevenChisholm>
@ritchiecarroll <https://github.com/ritchiecarroll>
I think I got the bug. It is in UnionTreeStreamSortHelper.SortAssumingIncreased(int
index), at line 116 as shown below.
[image: image]
<https://cloud.githubusercontent.com/assets/25643171/22859170/f760a7e0-f0a0-11e6-8747-45f814ca2863.png>
.
x should not start from 0 but index. Once there are duplicated keys in
BufferedTreeStreams, and the duplicated key happens to be the last key of
the second (or third and so on) BufferedTreeStream, the sort function
will make mistake.
One of the captured examples is shown below, in which 2017-02-12
00:37:22.3000000/785 is the duplicated key.
There are 3 sections:
1. The original order of BufferedTreeStreams and their cache key and
valid flag before removing duplicated key.
2. Information of the streams after removing duplicated key by
advancing the position of the second duplicate entry.
3. Information and order of streams after sorting the streams.
It caused 3 remaining streams ignored.
[image: image]
<https://cloud.githubusercontent.com/assets/25643171/22859212/bb2f8cda-f0a2-11e6-96b6-124a3c95ba19.png>
Then the data get lost.
[image: image]
<https://cloud.githubusercontent.com/assets/25643171/22859292/bbc6565e-f0a4-11e6-8257-fa68ded27a98.png>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ARkyJHR7fgMwcMEH_MGbdweB72yq216Dks5rbnVygaJpZM4L7IEB>
.
|
The problem was solved. Best wishes, |
No, thank you. You found the first major bug in the historian since it's
beta release 3 years ago. (At least in the core engine part that I wrote.)
I've checked in the fix. It should be available in tomorrow's nightly build.
437ed9d
…On Sun, Feb 12, 2017 at 10:49 AM, yuwenpeng ***@***.***> wrote:
The problem was solved.
Previously the data losing happened around 7 times every hour in average.
Now there isn't any missing data in the last 15 hours.
Thank you for all your help! @ritchiecarroll
<https://github.com/ritchiecarroll> @StevenChisholm
<https://github.com/StevenChisholm>
Best wishes,
Wenpeng Yu
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ARkyJOSAHBhuRR-kPappOTC7cI0dgPzUks5rbzgmgaJpZM4L7IEB>
.
|
I'll add my thank you as well - nice job @yuwenpeng! I'll close this for now given the discovered fix: Thanks again! |
Hi,
We found some of PMU's data are not archived to d2(i) files. Some of the historical data can't be found no matter via API or 'Historian Data Viewer'. The missing data may last from 1 second to around 10 seconds, and will happen several times every hour.
Then we added an action to openHistorian to count how many data are received in every frame. It shows all data and all frames are received by the action, which means openHistorian received the data. But some of the frames are not archived. We retrieved the data several hours later than the time when the data were received by openHistorian, but some of them can't be found.
In a short word, the data has 36,000 frames/hour, and all frames are received by openHistorian, but around 400 frames/hour in average can't be found via API or 'HistorianView.exe'
What should we do to fix this issue? Could it be something wrong with configurations?
We are running openHistorian 2.0.415 on Windows Server 2012 with .Net framework 4.5 and openHistorian 2.1 release on Windows 10 with .NET framework 4.6. Both of them are suffering missing data.
Best wishes,
Wenpeng
The text was updated successfully, but these errors were encountered: